Logistic regression is used to predict categorical outcomes. The presented document discusses logistic regression, including its objectives, assumptions, key terms, and an example application to predicting basketball match outcomes. Logistic regression uses maximum likelihood estimation to model the relationship between a binary dependent variable and independent variables. The document provides an illustrated example of conducting logistic regression in SPSS to predict match results based on variables like passes, rebounds, free throws, and blocks.
Bayes' Theorem relates prior probabilities, conditional probabilities, and posterior probabilities. It provides a mathematical rule for updating estimates based on new evidence or observations. The theorem states that the posterior probability of an event is equal to the conditional probability of the event given the evidence multiplied by the prior probability, divided by the probability of the evidence. Bayes' Theorem can be used to calculate conditional probabilities, like the probability of a woman having breast cancer given a positive mammogram result, or the probability that a part came from a specific supplier given that it is non-defective. It is widely applicable in science, medicine, and other fields for revising hypotheses based on new data.
The document discusses simple linear regression. It defines key terms like regression equation, regression line, slope, intercept, residuals, and residual plot. It provides examples of using sample data to generate a regression equation and evaluating that regression model. Specifically, it shows generating a regression equation from bivariate data, checking assumptions visually through scatter plots and residual plots, and interpreting the slope as the marginal change in the response variable from a one unit change in the explanatory variable.
The document provides an overview of statistics for data science. It introduces key concepts including descriptive versus inferential statistics, different types of variables and data, probability distributions, and statistical analysis methods. Descriptive statistics are used to describe data through measures of central tendency, variability, and visualization techniques. Inferential statistics enable drawing conclusions about populations from samples using hypothesis testing, confidence intervals, and regression analysis.
This chapter summary covers simple linear regression models. Key topics include determining the simple linear regression equation, measures of variation such as total, explained, and unexplained sums of squares, assumptions of the regression model including normality, homoscedasticity and independence of errors. Residual analysis is discussed to examine linearity and assumptions. The coefficient of determination, standard error of estimate, and Durbin-Watson statistic are also introduced.
- Simple linear regression is used to predict values of one variable (dependent variable) given known values of another variable (independent variable).
- A regression line is fitted through the data points to minimize the deviations between the observed and predicted dependent variable values. The equation of this line allows predicting dependent variable values for given independent variable values.
- The coefficient of determination (R2) indicates how much of the total variation in the dependent variable is explained by the regression line. The standard error of estimate provides a measure of how far the observed data points deviate from the regression line on average.
- Prediction intervals can be constructed around predicted dependent variable values to indicate the uncertainty in predictions for a given confidence level, based on the
Linear Regression vs Logistic Regression | EdurekaEdureka!
YouTube: https://youtu.be/OCwZyYH14uw
** Data Science Certification using R: https://www.edureka.co/data-science **
This Edureka PPT on Linear Regression Vs Logistic Regression covers the basic concepts of linear and logistic models. The following topics are covered in this session:
Types of Machine Learning
Regression Vs Classification
What is Linear Regression?
What is Logistic Regression?
Linear Regression Use Case
Logistic Regression Use Case
Linear Regression Vs Logistic Regression
Blog Series: http://bit.ly/data-science-blogs
Data Science Training Playlist: http://bit.ly/data-science-playlist
Follow us to never miss an update in the future.
YouTube: https://www.youtube.com/user/edurekaIN
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
Logistic regression is used to predict categorical outcomes. The presented document discusses logistic regression, including its objectives, assumptions, key terms, and an example application to predicting basketball match outcomes. Logistic regression uses maximum likelihood estimation to model the relationship between a binary dependent variable and independent variables. The document provides an illustrated example of conducting logistic regression in SPSS to predict match results based on variables like passes, rebounds, free throws, and blocks.
Bayes' Theorem relates prior probabilities, conditional probabilities, and posterior probabilities. It provides a mathematical rule for updating estimates based on new evidence or observations. The theorem states that the posterior probability of an event is equal to the conditional probability of the event given the evidence multiplied by the prior probability, divided by the probability of the evidence. Bayes' Theorem can be used to calculate conditional probabilities, like the probability of a woman having breast cancer given a positive mammogram result, or the probability that a part came from a specific supplier given that it is non-defective. It is widely applicable in science, medicine, and other fields for revising hypotheses based on new data.
The document discusses simple linear regression. It defines key terms like regression equation, regression line, slope, intercept, residuals, and residual plot. It provides examples of using sample data to generate a regression equation and evaluating that regression model. Specifically, it shows generating a regression equation from bivariate data, checking assumptions visually through scatter plots and residual plots, and interpreting the slope as the marginal change in the response variable from a one unit change in the explanatory variable.
The document provides an overview of statistics for data science. It introduces key concepts including descriptive versus inferential statistics, different types of variables and data, probability distributions, and statistical analysis methods. Descriptive statistics are used to describe data through measures of central tendency, variability, and visualization techniques. Inferential statistics enable drawing conclusions about populations from samples using hypothesis testing, confidence intervals, and regression analysis.
This chapter summary covers simple linear regression models. Key topics include determining the simple linear regression equation, measures of variation such as total, explained, and unexplained sums of squares, assumptions of the regression model including normality, homoscedasticity and independence of errors. Residual analysis is discussed to examine linearity and assumptions. The coefficient of determination, standard error of estimate, and Durbin-Watson statistic are also introduced.
- Simple linear regression is used to predict values of one variable (dependent variable) given known values of another variable (independent variable).
- A regression line is fitted through the data points to minimize the deviations between the observed and predicted dependent variable values. The equation of this line allows predicting dependent variable values for given independent variable values.
- The coefficient of determination (R2) indicates how much of the total variation in the dependent variable is explained by the regression line. The standard error of estimate provides a measure of how far the observed data points deviate from the regression line on average.
- Prediction intervals can be constructed around predicted dependent variable values to indicate the uncertainty in predictions for a given confidence level, based on the
Linear Regression vs Logistic Regression | EdurekaEdureka!
YouTube: https://youtu.be/OCwZyYH14uw
** Data Science Certification using R: https://www.edureka.co/data-science **
This Edureka PPT on Linear Regression Vs Logistic Regression covers the basic concepts of linear and logistic models. The following topics are covered in this session:
Types of Machine Learning
Regression Vs Classification
What is Linear Regression?
What is Logistic Regression?
Linear Regression Use Case
Logistic Regression Use Case
Linear Regression Vs Logistic Regression
Blog Series: http://bit.ly/data-science-blogs
Data Science Training Playlist: http://bit.ly/data-science-playlist
Follow us to never miss an update in the future.
YouTube: https://www.youtube.com/user/edurekaIN
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
Missing data handling is typically done in an ad-hoc way. Without understanding the repurcussions of a missing data handling technique, approaches that only let you get to the "next step" in your analytics pipeline leads to terrible outputs, conclusions that aren't robust and biased estimates. Handling missing data in data sets requires a structured approach. In this workshop, we will cover the key tenets of handling missing data in a structured way
Delta Analytics is a 501(c)3 non-profit in the Bay Area. We believe that data is powerful, and that anybody should be able to harness it for change. Our teaching fellows partner with schools and organizations worldwide to work with students excited about the power of data to do good.
Welcome to the course! These modules will teach you the fundamental building blocks and the theory necessary to be a responsible machine learning practitioner in your own community. Each module focuses on accessible examples designed to teach you about good practices and the powerful (yet surprisingly simple) algorithms we use to model data.
To learn more about our mission or provide feedback, take a look at www.deltanalytics.org.
Scatter plots graph ordered pairs of data and can show positive, negative, or no correlation between two variables. A positive correlation means both variables increase together, while a negative correlation means one increases as the other decreases. The correlation coefficient measures the strength of the linear relationship between -1 and 1. An example scatter plot shows U.S. SUV sales increasing each year from 1991 to 1999, indicating a positive correlation between year and sales.
Polynomial regression models the relationship between variables as a polynomial equation rather than a linear one. It allows for modeling of curvilinear relationships. The document discusses the definition of polynomial regression, why it is used, its history, the regression model and matrix form, how to implement it in Matlab, its advantages in fitting flexible curves, and its disadvantages related to sensitivity to outliers.
Logistic regression allows prediction of discrete outcomes from continuous and discrete variables. It addresses questions like discriminant analysis and multiple regression but without distributional assumptions. There are two main types: binary logistic regression for dichotomous dependent variables, and multinomial logistic regression for variables with more than two categories. Binary logistic regression expresses the log odds of the dependent variable as a function of the independent variables. Logistic regression assesses the effects of multiple explanatory variables on a binary outcome variable. It is useful when the dependent variable is non-parametric, there is no homoscedasticity, or normality and linearity are suspect.
This document provides an overview of decision trees, including:
- Decision trees classify records by sorting them down the tree from root to leaf node, where each leaf represents a classification outcome.
- Trees are constructed top-down by selecting the most informative attribute to split on at each node, usually based on information gain.
- Trees can handle both numerical and categorical data and produce classification rules from paths in the tree.
- Examples of decision tree algorithms like ID3 that use information gain to select the best splitting attribute are described. The concepts of entropy and information gain are defined for selecting splits.
This document provides an overview of linear regression analysis. It defines key terms like dependent and independent variables. It describes simple linear regression, which involves predicting a dependent variable based on a single independent variable. It covers techniques for linear regression including least squares estimation to calculate the slope and intercept of the regression line, the coefficient of determination (R2) to evaluate the model fit, and assumptions like independence and homoscedasticity of residuals. Hypothesis testing methods for the slope and correlation coefficient using the t-test and F-test are also summarized.
The document discusses decision tree algorithms. It begins with an introduction and example, then covers the principles of entropy and information gain used to build decision trees. It provides explanations of key concepts like entropy, information gain, and how decision trees are constructed and evaluated. Examples are given to illustrate these concepts. The document concludes with strengths and weaknesses of decision tree algorithms.
The document discusses exploratory data analysis (EDA) techniques in R. It explains that EDA involves analyzing data using visual methods to discover patterns. Common EDA techniques in R include descriptive statistics, histograms, bar plots, scatter plots, and line graphs. Tools like R and Python are useful for EDA due to their data visualization capabilities. The document also provides code examples for creating various graphs in R.
Decision trees are a type of supervised learning algorithm used for classification and regression. ID3 and C4.5 are algorithms that generate decision trees by choosing the attribute with the highest information gain at each step. Random forest is an ensemble method that creates multiple decision trees and aggregates their results, improving accuracy. It introduces randomness when building trees to decrease variance.
This document provides an overview of logistic regression analysis. It introduces the need for logistic regression when the dependent variable is binary. Key concepts covered include the logistic regression model, interpreting the beta coefficients, assessing goodness of fit using various tests and metrics, and an example of fitting a logistic regression line to predict burger purchasing based on a customer's age. Students are instructed to use statistical software to estimate a logistic regression model and interpret the results.
This document provides an overview of maximum likelihood estimation. It explains that maximum likelihood estimation finds the parameters of a probability distribution that make the observed data most probable. It gives the example of using maximum likelihood estimation to find the values of μ and σ that result in a normal distribution that best fits a data set. The goal of maximum likelihood is to find the parameter values that give the distribution with the highest probability of observing the actual data. It also discusses the concept of likelihood and compares it to probability, as well as considerations for removing constants and using the log-likelihood.
Logistic regression is a statistical method used to predict a binary or categorical dependent variable from continuous or categorical independent variables. It generates coefficients to predict the log odds of an outcome being present or absent. The method assumes a linear relationship between the log odds and independent variables. Multinomial logistic regression extends this to dependent variables with more than two categories. An example analyzes high school student program choices using writing scores and socioeconomic status as predictors. The model fits significantly better than an intercept-only model. Increases in writing score decrease the log odds of general versus academic programs.
Introduces and explains the use of multiple linear regression, a multivariate correlational statistical technique. For more info, see the lecture page at http://goo.gl/CeBsv. See also the slides for the MLR II lecture http://www.slideshare.net/jtneill/multiple-linear-regression-ii
The document describes the author's experience with evaluating a new machine learning algorithm. It discusses how:
- The author and a student developed a new algorithm that performed better than state-of-the-art according to standard evaluation practices, winning an award, but the NIH disagreed.
- This surprised the author and led to an realization that evaluation is more complex than initially understood, motivating the author to learn more about evaluation methods from other fields.
- The author has since co-written a book on evaluating machine learning algorithms from a classification perspective, and the tutorial presented is based on this book to provide an overview of issues in evaluation and available resources.
The document provides an overview of linear models and their extensions for data science applications. It begins with an introduction to linear regression and how it finds the coefficients that minimize squared error loss. It then discusses generalizing linear models to binary data using link functions. Regularization methods like ridge regression, lasso, elastic net, and grouped lasso are introduced to reduce overfitting. The document also covers extensions such as generalized additive models, support vector machines, and mixed effects models. Overall, the document aims to convince the reader that simple linear models can be very effective while also introducing more advanced techniques.
Descriptive Statistics and Data VisualizationDouglas Joubert
This document provides an overview of descriptive statistics and data visualization techniques. It discusses levels of measurement, descriptive versus inferential statistics, and univariate analysis. Various graphical methods for displaying data are also described, including frequency distributions, histograms, Pareto charts, boxplots, and scatterplots. The document aims to help readers choose appropriate analysis and visualization methods based on their research questions and data types.
The document summarizes a study on applying ordinal logistic regression to analyze a proposed new integrated education plan takaful (Islamic insurance) product. The study used a questionnaire distributed to 410 respondents to collect data on demographics and preferences. Ordinal logistic regression and correlation analyses found high acceptance of the integrated plan among all income levels. The proposed plan combines multiple riders into one affordable plan.
According to Wikipedia point estimation involves the use of sample data to calculate a single value (known as a point estimate since it identifies a point in some parameter space) which is to serve as a "best guess" or "best estimate" of an unknown population parameter (for example, the population means).
The document discusses key concepts in statistical inference including estimation, confidence intervals, hypothesis testing, and types of errors. It provides examples and formulas for estimating population means from sample data, calculating confidence intervals, stating the null and alternative hypotheses, and making decisions to accept or reject the null hypothesis based on a significance level.
This document provides an overview of simple linear regression analysis. It discusses estimating regression coefficients using the least squares method, interpreting the regression equation, assessing model fit using measures like the standard error of the estimate and coefficient of determination, testing hypotheses about regression coefficients, and using the regression model to make predictions.
Unit-III Correlation and Regression.pptxAnusuya123
Unit-III describes different types of relationships between variables through correlation and regression analysis. It discusses:
1) Correlation measures the strength and direction of a linear relationship between two variables on a scatter plot. Positive correlation means variables increase together, while negative correlation means one increases as the other decreases.
2) Regression analysis uses independent variables to predict outcomes of a dependent variable. A regression line minimizes the squared errors between predicted and actual values.
3) The correlation coefficient r and coefficient of determination r-squared quantify the strength and direction of linear relationships, with values between -1 and 1. Extreme scores on one measurement tend to regress toward the mean on subsequent measurements.
Missing data handling is typically done in an ad-hoc way. Without understanding the repurcussions of a missing data handling technique, approaches that only let you get to the "next step" in your analytics pipeline leads to terrible outputs, conclusions that aren't robust and biased estimates. Handling missing data in data sets requires a structured approach. In this workshop, we will cover the key tenets of handling missing data in a structured way
Delta Analytics is a 501(c)3 non-profit in the Bay Area. We believe that data is powerful, and that anybody should be able to harness it for change. Our teaching fellows partner with schools and organizations worldwide to work with students excited about the power of data to do good.
Welcome to the course! These modules will teach you the fundamental building blocks and the theory necessary to be a responsible machine learning practitioner in your own community. Each module focuses on accessible examples designed to teach you about good practices and the powerful (yet surprisingly simple) algorithms we use to model data.
To learn more about our mission or provide feedback, take a look at www.deltanalytics.org.
Scatter plots graph ordered pairs of data and can show positive, negative, or no correlation between two variables. A positive correlation means both variables increase together, while a negative correlation means one increases as the other decreases. The correlation coefficient measures the strength of the linear relationship between -1 and 1. An example scatter plot shows U.S. SUV sales increasing each year from 1991 to 1999, indicating a positive correlation between year and sales.
Polynomial regression models the relationship between variables as a polynomial equation rather than a linear one. It allows for modeling of curvilinear relationships. The document discusses the definition of polynomial regression, why it is used, its history, the regression model and matrix form, how to implement it in Matlab, its advantages in fitting flexible curves, and its disadvantages related to sensitivity to outliers.
Logistic regression allows prediction of discrete outcomes from continuous and discrete variables. It addresses questions like discriminant analysis and multiple regression but without distributional assumptions. There are two main types: binary logistic regression for dichotomous dependent variables, and multinomial logistic regression for variables with more than two categories. Binary logistic regression expresses the log odds of the dependent variable as a function of the independent variables. Logistic regression assesses the effects of multiple explanatory variables on a binary outcome variable. It is useful when the dependent variable is non-parametric, there is no homoscedasticity, or normality and linearity are suspect.
This document provides an overview of decision trees, including:
- Decision trees classify records by sorting them down the tree from root to leaf node, where each leaf represents a classification outcome.
- Trees are constructed top-down by selecting the most informative attribute to split on at each node, usually based on information gain.
- Trees can handle both numerical and categorical data and produce classification rules from paths in the tree.
- Examples of decision tree algorithms like ID3 that use information gain to select the best splitting attribute are described. The concepts of entropy and information gain are defined for selecting splits.
This document provides an overview of linear regression analysis. It defines key terms like dependent and independent variables. It describes simple linear regression, which involves predicting a dependent variable based on a single independent variable. It covers techniques for linear regression including least squares estimation to calculate the slope and intercept of the regression line, the coefficient of determination (R2) to evaluate the model fit, and assumptions like independence and homoscedasticity of residuals. Hypothesis testing methods for the slope and correlation coefficient using the t-test and F-test are also summarized.
The document discusses decision tree algorithms. It begins with an introduction and example, then covers the principles of entropy and information gain used to build decision trees. It provides explanations of key concepts like entropy, information gain, and how decision trees are constructed and evaluated. Examples are given to illustrate these concepts. The document concludes with strengths and weaknesses of decision tree algorithms.
The document discusses exploratory data analysis (EDA) techniques in R. It explains that EDA involves analyzing data using visual methods to discover patterns. Common EDA techniques in R include descriptive statistics, histograms, bar plots, scatter plots, and line graphs. Tools like R and Python are useful for EDA due to their data visualization capabilities. The document also provides code examples for creating various graphs in R.
Decision trees are a type of supervised learning algorithm used for classification and regression. ID3 and C4.5 are algorithms that generate decision trees by choosing the attribute with the highest information gain at each step. Random forest is an ensemble method that creates multiple decision trees and aggregates their results, improving accuracy. It introduces randomness when building trees to decrease variance.
This document provides an overview of logistic regression analysis. It introduces the need for logistic regression when the dependent variable is binary. Key concepts covered include the logistic regression model, interpreting the beta coefficients, assessing goodness of fit using various tests and metrics, and an example of fitting a logistic regression line to predict burger purchasing based on a customer's age. Students are instructed to use statistical software to estimate a logistic regression model and interpret the results.
This document provides an overview of maximum likelihood estimation. It explains that maximum likelihood estimation finds the parameters of a probability distribution that make the observed data most probable. It gives the example of using maximum likelihood estimation to find the values of μ and σ that result in a normal distribution that best fits a data set. The goal of maximum likelihood is to find the parameter values that give the distribution with the highest probability of observing the actual data. It also discusses the concept of likelihood and compares it to probability, as well as considerations for removing constants and using the log-likelihood.
Logistic regression is a statistical method used to predict a binary or categorical dependent variable from continuous or categorical independent variables. It generates coefficients to predict the log odds of an outcome being present or absent. The method assumes a linear relationship between the log odds and independent variables. Multinomial logistic regression extends this to dependent variables with more than two categories. An example analyzes high school student program choices using writing scores and socioeconomic status as predictors. The model fits significantly better than an intercept-only model. Increases in writing score decrease the log odds of general versus academic programs.
Introduces and explains the use of multiple linear regression, a multivariate correlational statistical technique. For more info, see the lecture page at http://goo.gl/CeBsv. See also the slides for the MLR II lecture http://www.slideshare.net/jtneill/multiple-linear-regression-ii
The document describes the author's experience with evaluating a new machine learning algorithm. It discusses how:
- The author and a student developed a new algorithm that performed better than state-of-the-art according to standard evaluation practices, winning an award, but the NIH disagreed.
- This surprised the author and led to an realization that evaluation is more complex than initially understood, motivating the author to learn more about evaluation methods from other fields.
- The author has since co-written a book on evaluating machine learning algorithms from a classification perspective, and the tutorial presented is based on this book to provide an overview of issues in evaluation and available resources.
The document provides an overview of linear models and their extensions for data science applications. It begins with an introduction to linear regression and how it finds the coefficients that minimize squared error loss. It then discusses generalizing linear models to binary data using link functions. Regularization methods like ridge regression, lasso, elastic net, and grouped lasso are introduced to reduce overfitting. The document also covers extensions such as generalized additive models, support vector machines, and mixed effects models. Overall, the document aims to convince the reader that simple linear models can be very effective while also introducing more advanced techniques.
Descriptive Statistics and Data VisualizationDouglas Joubert
This document provides an overview of descriptive statistics and data visualization techniques. It discusses levels of measurement, descriptive versus inferential statistics, and univariate analysis. Various graphical methods for displaying data are also described, including frequency distributions, histograms, Pareto charts, boxplots, and scatterplots. The document aims to help readers choose appropriate analysis and visualization methods based on their research questions and data types.
The document summarizes a study on applying ordinal logistic regression to analyze a proposed new integrated education plan takaful (Islamic insurance) product. The study used a questionnaire distributed to 410 respondents to collect data on demographics and preferences. Ordinal logistic regression and correlation analyses found high acceptance of the integrated plan among all income levels. The proposed plan combines multiple riders into one affordable plan.
According to Wikipedia point estimation involves the use of sample data to calculate a single value (known as a point estimate since it identifies a point in some parameter space) which is to serve as a "best guess" or "best estimate" of an unknown population parameter (for example, the population means).
The document discusses key concepts in statistical inference including estimation, confidence intervals, hypothesis testing, and types of errors. It provides examples and formulas for estimating population means from sample data, calculating confidence intervals, stating the null and alternative hypotheses, and making decisions to accept or reject the null hypothesis based on a significance level.
This document provides an overview of simple linear regression analysis. It discusses estimating regression coefficients using the least squares method, interpreting the regression equation, assessing model fit using measures like the standard error of the estimate and coefficient of determination, testing hypotheses about regression coefficients, and using the regression model to make predictions.
Unit-III Correlation and Regression.pptxAnusuya123
Unit-III describes different types of relationships between variables through correlation and regression analysis. It discusses:
1) Correlation measures the strength and direction of a linear relationship between two variables on a scatter plot. Positive correlation means variables increase together, while negative correlation means one increases as the other decreases.
2) Regression analysis uses independent variables to predict outcomes of a dependent variable. A regression line minimizes the squared errors between predicted and actual values.
3) The correlation coefficient r and coefficient of determination r-squared quantify the strength and direction of linear relationships, with values between -1 and 1. Extreme scores on one measurement tend to regress toward the mean on subsequent measurements.
This document provides an overview of regression analysis. It defines regression analysis as a predictive modeling technique used to investigate relationships between dependent and independent variables. It describes simple linear regression as involving one independent variable and one dependent variable, with the goal of finding the best fitting straight line through the data points. An example is provided to demonstrate how to conduct a simple linear regression to predict population in the year 2005 based on population data from previous years.
FSE 200AdkinsPage 1 of 10Simple Linear Regression Corr.docxbudbarber38650
FSE 200
Adkins Page 1 of 10
Simple Linear Regression
Correlation only measures the strength and direction of the linear relationship between two quantitative variables. If the relationship is linear, then we would like to try to model that relationship with the equation of a line. We will use a regression line to describe the relationship between an explanatory variable and a response variable.
A regression line is a straight line that describes how a response variable y changes as an explanatory variable x changes. We often use a regression line to predict the value of y for a given value of x.
Ex. It has been suggested that there is a relationship between sleep deprivation of employees and the ability to complete simple tasks. To evaluate this hypothesis, 12 people were asked to solve simple tasks after having been without sleep for 15, 18, 21, and 24 hours. The sample data are shown below.
Subject
Hours without sleep, x
Tasks completed, y
1
15
13
2
15
9
3
15
15
4
18
8
5
18
12
6
18
10
7
21
5
8
21
8
9
21
7
10
24
3
11
24
5
12
24
4
Draw a scatterplot and describe the relationship. Lay a straight-edge on top of the plot and move it around until you find what you think might be a “line of best fit.” Then try to predict the number of tasks completed for someone having been without sleep 16 hours.
Was your line the same as that of the classmate sitting next to you? Probably not. We need a method that we can use to find the “best” regression line to use for prediction. The method we will use is called least-squares. No line will pass exactly through all the points in the scatterplot. When we use the line to predict a y for a given x value, if there is a data point with that same x value, we can compute the error (residual):
Our goal is going to be to make the vertical distances from the line as small as possible. The most commonly used method for doing this is the least-squares method.
The least-squares regression line of y on x is the line that makes the sum of the squares of the vertical distances of the data points from the line as small as possible.
Equation of the Least-Squares Regression Line
· Least-Squares Regression Line:
· Slope of the Regression Line:
· Intercept of the Regression Line:
Generally, regression is performed using statistical software. Clearly, given the appropriate information, the above formulas are simple to use.
Once we have the regression line, how do we interpret it, and what can we do with it?
The slope of a regression line is the rate of change, that amount of change in when x increases by 1.
The intercept of the regression line is the value of when x = 0. It is statistically meaningful only when x can take on values that are close to zero.
To make a prediction, just substitute an x-value into the equation and find .
To plot the line on a scatterplot, just find a couple of points on the regression line, one near each end of the range of x in the data. Plot the points and connect them with a line. .
This document provides an introduction to correlation and regression. It defines correlation as a measure of the association between two numerical variables, and describes positive and negative correlation. Regression analysis is introduced as a method to describe and predict the relationship between two variables. The key aspects of simple linear regression are discussed, including determining the line of best fit and evaluating the model performance using the coefficient of determination (R2).
This document discusses multiple linear regression. It begins by explaining linear regression and its applications. It then discusses multiple linear regression, where there is more than one independent variable. As an example, it describes using multiple linear regression to estimate company profits based on various independent variables. The document provides resources for learning more about linear regression in Python.
The document provides information about statistics and economics tutorials being offered after school, including regression analysis, correlation, and the normal distribution. It gives examples of calculating rank correlation, finding regression equations, and using the standard normal distribution table. It also explains key aspects of the normal distribution like the 68-95-99.7 rule and how to calculate probabilities using the normal distribution function in Excel.
This document discusses quantitative and qualitative data analysis techniques. It covers:
- Displays for numerical (frequency charts, histograms) and categorical data (bar charts, pie charts, contingency tables).
- Measures for numerical data including mean, median, mode, range, variance, standard deviation, and quartiles.
- Scatter plots to examine relationships between two quantitative variables and measures of association like covariance and correlation coefficient.
- Contingency tables to study relationships between two categorical variables and examine dependency/independency.
- An example analyzing Titanic passenger data using contingency tables to examine the "first-class passengers first" policy.
The document provides an overview of regression analysis including:
- Regression analysis is a statistical process used to estimate relationships between variables and predict unknown values.
- The document outlines different types of regression like simple, multiple, linear, and nonlinear regression.
- Key aspects of regression like scatter diagrams, regression lines, and the method of least squares are explained.
- An example problem is worked through demonstrating how to calculate the slope and y-intercept of a regression line using the least squares method.
This document provides an overview of regression analysis and two-way tables. It defines key concepts such as regression lines, correlation, residuals, and marginal and conditional distributions. Regression finds the linear relationship between two variables to make predictions. The least squares regression line minimizes the vertical distance between the data points and the line. Correlation and the coefficient of determination r2 measure how well the regression line fits the data. Two-way tables summarize the relationship between two categorical variables through marginal and conditional distributions.
This document provides an overview of correlation and linear regression. It defines key terms like independent variable, dependent variable, correlation coefficient, and regression coefficients. It explains how to calculate the correlation coefficient and regression coefficients using the least squares method. Properties of the regression coefficients are also discussed. Examples are provided to demonstrate how to interpret correlation, draw scatter plots, calculate coefficients, and predict values using the linear regression equation.
Research method ch09 statistical methods 3 estimation npnaranbatn
This document provides an overview of estimation and correlation analysis techniques used in research methods. It defines key terms like correlation, linear regression, and discusses topics like dealing with collinearity between variables. Non-parametric tests that don't assume a particular distribution are also introduced, such as the Wilcoxon test, chi-square test, and Kruskal-Wallis test. Multivariate techniques like discriminant analysis, multivariate ANOVA, and factor analysis are briefly outlined.
Correlation and regression analysis are statistical methods used to determine if a relationship exists between variables and describe the nature of that relationship. A scatter plot graphs the independent and dependent variables and allows visualization of any trends in the data. The correlation coefficient measures the strength and direction of the linear relationship between variables, ranging from -1 to 1. Regression finds the linear "best fit" line that minimizes the residuals and can be used to predict dependent variable values.
Correlation and regression analysis are statistical methods used to determine if a relationship exists between variables and describe the nature of that relationship. A scatter plot graphs the independent and dependent variables and allows visualization of any trends in the data. The correlation coefficient measures the strength and direction of the linear relationship between variables, ranging from -1 to 1. Regression finds the linear "best fit" line that minimizes the residuals, or differences between observed and predicted dependent variable values. The coefficient of determination measures how much variation in the dependent variable is explained by the regression model.
Data Science - Part XII - Ridge Regression, LASSO, and Elastic NetsDerek Kane
The document discusses various regression techniques including ridge regression, lasso regression, and elastic net regression. It begins with an overview of advancements in regression analysis since the late 1800s/early 1900s enabled by increased computing power. Modern high-dimensional data often has many independent variables, requiring improved regression methods. The document then provides technical explanations and formulas for ordinary least squares regression, ridge regression, lasso regression, and their properties such as bias-variance tradeoffs. It explains how ridge and lasso regression address limitations of OLS through regularization that shrinks coefficients.
Isotonic Regression is a statistical technique of fitting a free-form line to a sequence of observations such that the fitted line is non-decreasing (or non-increasing) everywhere, and lies as close to the observations as possible. Isotonic Regression is limited to predicting numeric output so the dependent variable must be numeric in nature…
how to select the appropriate method for our study of interest NurFathihaTahiatSeeu
The document discusses using linear regression to analyze the relationship between the duration of studying before an exam (independent variable) and exam performance (dependent variable). It provides an overview of linear regression, noting it describes how a dependent variable relates to one or more independent variables. It also discusses key assumptions that must be met for linear regression, such as having continuous and linearly related variables, independence of observations, and normally distributed residuals. Checking tools like histograms and normal P-P plots are recommended to validate meeting the assumptions.
This presentation covered the following topics:
1. Definition of Correlation and Regression
2. Meaning of Correlation and Regression
3. Types of Correlation and Regression
4. Karl Pearson's methods of correlation
5. Bivariate Grouped data method
6. Spearman's Rank correlation Method
7. Scattered diagram method
8. Interpretation of correlation coefficient
9. Lines of Regression
10. regression Equations
11. Difference between correlation and regression
12. Related examples
Data Science - Part IV - Regression Analysis & ANOVADerek Kane
This lecture provides an overview of linear regression analysis, interaction terms, ANOVA, optimization, log-level, and log-log transformations. The first practical example centers around the Boston housing market where the second example dives into business applications of regression analysis in a supermarket retailer.
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Aggregage
This webinar will explore cutting-edge, less familiar but powerful experimentation methodologies which address well-known limitations of standard A/B Testing. Designed for data and product leaders, this session aims to inspire the embrace of innovative approaches and provide insights into the frontiers of experimentation!
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
The Ipsos - AI - Monitor 2024 Report.pdfSocial Samosa
According to Ipsos AI Monitor's 2024 report, 65% Indians said that products and services using AI have profoundly changed their daily life in the past 3-5 years.
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
2. Agenda for Today’s Session
WHAT IS REGRESSION
We will understand the Regression and
will move onto the types.
REGRESSION DIAGNOSTICS
Lets find what best works with Linear
Regression
2
LINEAR REGRESSION vs LOGISITIC
Lets find out the difference between
Linear & Logistic Regression.
UNDERSTANDING LINEAR REGRESSION
ALGORITHM ASSUMPTIONS
Lets Look into the Assumptions &
Violations
UNDERSTANDING LOGISTIC REGRESSION
Lets Look into the Algorithm & understand how it
functions.
INTERVIEW QUESTIONS
Covering some interview questions to strengthen
the knowledge and give an idea about Interviews
too.
5. Understanding Linear Regression Algorithm
5
Establishes a relationship between the Independent
& Dependent Variables.
Examples of Independent & Dependent Variables:-
• x is Rainfall and y is Crop Yield
• x is Advertising Expense and y is Sales
• x is sales of goods and y is GDP
Here x is Independent Variable & Y is Dependent Variable
Intro
How it Works
• Regression analysis is used to understand which among the
Independent Variables are related to Dependent Variables.
• It attempts to model relationship between two variables by fitting a
line called Linear Regression Line.
• The case of Single variable is called Simple Linear Regression
where as the case of Multiple Independent Variables, it is called
Multiple Linear Regression
6. 6
Single Linear Regression Vs Multiple Linear Regression
The Linear Regression line is created using Ordinary Least Square Method.
X Y
Simple Linear Regression
Multiple Linear Regression
X1
Y
X2
X3
X4
Multiple
Predictors
8. Sum of Squared Error
What is Error?
• Actual Value – Predicted Value is called Error
• Here Predicted Value is the value predicted by
the Linear Regression Model.
• Also known as Residual.
Why it is Important?
Smaller the residuals, more accurate model it
would be.
9. Regression Line | Best Fit Line
What is the Line of Best Fit?
• The Best Fit Line is the line that gives the minimum SSE.
• Amongst all the possible lines, there will be one line that will be the best fit
meaning greatest possible accuracy as a model.
• The line that minimizes the sum of squared error of residuals is called
Regression Line or the Best Fit Line.
• In Simple Terms, it represents a straight line that best represents the data on
scatterplot. It is drawn using the Least Square Method.
Use the Least Square Method to Determine the Eqn. of Line of Best Fit
x 8 2 11 6 5 4 12 9 6 1
y 3 10 3 6 8 12 1 4 9 14
10. Finding Best Fit Line Algorithm
How to find the Best Fit Line?
• The equation of the straight line is given by y = mx +c
m – slope of the line
c - Intercept (The point at which the straight line touches y axis.
• The Best Fit line is found basis the Least Squared Method.
Algorithm:
Step1: Find the Mean of x-values and y-values
Step2: Calculate the slope of the line. It can be found using the following eqn.
on the right.
Step3: Compute the y-intercept of the line using the formula
Mean of x and y values
Finding m (Slope)
11. Regression Line | Best Fit Line
Use the Least Square Method to Determine the Eqn. of Line of Best Fit
Steps: Following Steps are deployed to achieve the objective.
Step1: Calculate the Mean of X and Ys
Step2: Find the Following:-
12. Regression Line | Best Fit Line
Step1: Calculate the Mean of X and Ys
The mean value of x is 6.4 and y is 7
Step2: Find the m (slope)
m (slope) = -131/118.4 = -1.1 approx.
Step3: Calculate the y intercept
b (intercept) = 7 – (-1.1) * 6.4 = 14.0 approx.
Thus, the equation of the line is y = -1.1x + 14
14. R Squared
1. R Squared is a statistical measure
that represents the proportion of the
variance for a DV explained by IV.
2. While correlation defines the strength,
R Squared explains up to what extent
the variance of one variable explains
the variance of another var.
3. Example – In Investing, the R
Squared is %age of a fund movement
that can be explained by the
movement of benchmark(sensex)
4. Aka Coefficient of Determination.
16. How to see if the Assumptions are Violated – Deciding if Linear Model
is a good Fit
16
Residual vs Fitted Values Plot
1. The x-axis has the fitted values and
y axis has the Residuals.
2. Residual = Observed y value –
Predicted y value.
3. Vertical distance of actual point vs
line of the best fit is called Residual.
4. If unsure about the shape (curve) for
regression equation by looking into
the scatterplot, a residual plot helps
in making decision.
When a pattern is observed in a residual plot,
a linear regression model is probably not appropriate for your data.
Data should be randomly
scattered around line 0
17. Normal Q-Q Plot (Quantile Quantile Plot)
17
1. If the data is normally distributed, the points in the QQ-normal plot lie on a straight
diagonal line.
2. Greater the departure from this reference line, the greater the evidence that the data
is not following the normal distribution pattern.
18. Interview Questions
18
01. Which of the following is true about Residuals?
A) Lower is Better
B) Higher is Better
C) A or B depends on the situation.
D) None of these
19. Interview Questions - Solution
19
01. Which of the following is true about Residuals?
A) Lower is Better
B) Higher is Better
C) A or B depends on the situation.
D) None of these
Solution A: Residuals refer to the error value of the model. Hence, Lower is Better. (
Residual = y – yhat)
20. Interview Questions
20
02. Which of the statement is true regarding residuals in regression
A) Mean of the Residuals is always zero.
B) Mean of the Residuals is always less than zero.
C) Mean of the Residuals is always more than zero.
D) There is no such rule for residuals.
21. Interview Questions - Solution
21
02. Which of the statement is true regarding residuals in regression
A) Mean of the Residuals is always zero.
B) Mean of the Residuals is always less than zero.
C) Mean of the Residuals is always more than zero.
D) There is no such rule for residuals.
Solution: A
Sum of residual in regression is always zero. It the sum of residuals is zero,
the ‘Mean’ will also be zero.
22. Interview Questions
22
03. To Test linear relationship of y (dependent) and x (independent) continuous
variable, which of the following plots are best suited.
A) Scatterplot
B) Barplot
C) Histograms
D) None of These.
23. Interview Questions - Solution
23
03. To Test linear relationship of y (dependent) and x (independent) continuous
variable, which of the following plots are best suited.
A) Scatterplot
B) Barplot
C) Histograms
D) None of These.
Solution: A
To test the linear relationship between continuous variables Scatter plot is a
good option. We can find out how one variable is changing w.r.t. another
variable. A scatter plot displays the relationship between two quantitative
variables.
24. Interview Questions
24
04. A Correlation between the age and health of a person found to be -1.09. On the basis of
this you would tell the doctors that:
A) The age is good predictor of health
B) The age is poor predictor of health
C) None of These.
25. Interview Questions - Solution
25
04. A Correlation between the age and health of a person found to be -1.09. On the basis of
this you would tell the doctors that:
A) The age is good predictor of health
B) The age is poor predictor of health
C) None of These.
Solution: C
Correlation coefficient range is between [-1 ,1]. So -1.09 is not possible.
26. Interview Questions
26
05. Which of the following offsets, do we use in case of least square line fit?
Suppose horizontal axis is independent variable and vertical axis is dependent variable.
A) Vertical Offset
B) Perpendicular Offset
C) Both
D) None of the Above.
27. Interview Questions - Solution
27
05. Which of the following offsets, do we use in case of least square line fit?
Suppose horizontal axis is independent variable and vertical axis is dependent variable.
A) Vertical Offset
B) Perpendicular Offset
C) Both
D) None of the Above.
Solution: A
We always consider residual as vertical offsets. Perpendicular offset are useful in case
of PCA.
28. 28
Linear Regression – Model Assumptions
Since Linear Regression assesses whether one or more predictor variables explain the dependent
Variable and hence it has 05 assumptions:
1. Linear Relationship
2. Normality
3. No or Little Multicollinearity
4. No Auto Correlation in errors
5. Homoscedasticity
Note on sample size – The sample size thumb rule is that regression analysis requires at least 20 data
points per independent variable in analysis.
29. 29
1. Check for Linearity
1. Linear Regression needs the
relationship between the independent &
dependent variable to linear & additive.
2. Being additive means the effect of x on y
is independent of other variables.
3. The linearity can be checked using the
scatter plots.
4. Some examples are shown on right. It
shows little to no correlation
30. Transforming Variables to Achieve Linearity
30
• Each Row shows a different transformation method.
• Transform column shows the method of transformation to be applied on DV or IV.
• Regression equation is the equation used in analysis.
• Last Column shows the equation of Prediction.
31. Non Linear to Linear Conversion
31
The best transformation depends of the data & the best model will give the highest
coefficient of Determination.
Steps Involved:-
1. Create Linear Regression Model.
2. Construct a residual plot
3. If the plot is random, don’t transform the data.
4. Compute the Coefficient of Determination (R2)
5. Choose a Transformation method as mentioned in table in previous slide.
6. Transform IV or DV or both.
7. Apply Regression
8. If the Transformed R2 is greater than the previous score, the transformation is a
success.
32. Transformation Example
32
Objectives:
1. Create Linear Regression Model in Excel or R.
2. Find the Linear Regression Equation
3. Make Predictions
4. Create a Residual Plot and Find if Linear
Regression is an Absolute Fit or not.
5. If Not, Try Transformation 0
10
20
30
40
50
60
70
80
0 2 4 6 8 10
Scatterplot Y Vs X
33. 33
2. Check for Normality
1. Linear Regression requires all the
variables need to be normal.
2. The normality can be checked using the
histogram or Q-Q Plots.
3. Test for Normality aka goodness of fit test
is called Kolmogorov Smirnov Test or
Shapiro Wilk Normality Test
4. If the Data is not normal a non linear
transformation ( e.g. Log Transformation)
can fix the issue.
5. Normality means that Y values are
normally distributed to each X.
34. 34
3. Check for Multicollinearity
1. It means that the predictors are correlated with each other. Presence of correlation in
independent variables lead of Multicollinearity.
2. What happens if variables are correlated - it becomes difficult for the model to determine
the true effect of Independent Variables on Dependent.
3. Measure of Multicollinearity is given by VIF (Variable Influence Factor)
1. VIF tells us if the predictors are correlated, how much variance of an estimated coefficient
increases. If no factors are correlated, VIF will be 1.
2. If VIF is 1 - No Multicollinearity.
3. VIF>1, the predictors may be correlated.
4. VIF between 5 & 10 – Indicates high correlation.
5. VIF >10: Regression coefficients are poorly estimated due to Multicollinearity.
Solution:
1. To drop the variable showing high Collinearity. The presence of C suggests that the information
provided by this variable for the DV is redundant and is of no use.
2. Another Approach is to combined the collinear variables and create new predictor (for e.g. taking
average).
37. 4. Heteroscedasticity
37
Image Source: Google
Meaning that Data has different dispersion. In other terms, it is called with unequal scatter.
Why it is a Problem
It is a problem because OLS Regression assumes that all residuals are drawn from
population that has constant variance ( Homoscedasticity). Ruins Results. Gives Biased
Coefficients.
Heteroscedasticity, also spelled heteroskedasticity, occurs more often in datasets that
have a large range between the largest and smallest observed values.
A classic example of heteroscedasticity is If you model household consumption based on
income, you’ll find that the variability in consumption increases as income increases.
Breusch-Pagan / Cook Weisberg Test - This test is used to determine presence of heteroskedasticity. If
you find p < 0.05, you reject the null hypothesis and infer that heteroskedasticity is present.
38. Scale Location Plot
38
• This plot is also useful to determine heteroskedasticity. Ideally, this plot shouldn't show any pattern.
• Presence of a pattern determine heteroskedasticity.
• Don't forget to corroborate the findings of this plot with the funnel shape in residual vs. fitted values.
•
39. Leverage Plot
39
Leverage: An observation with an extreme value on a predictor variable is called a point with high
leverage. Leverage is a measure of how far an observation deviates from the mean of that variable.
These leverage points can have an effect on the estimate of regression coefficients.
40. Summary of Tests in Python for Linear Regression Assumptions
40
Multicollinearity Test
from statsmodels.stats.outliers_influence
import variance_inflation_factor
Variance Inflation Factor
Normality Test
from scipy.stats import shapiro
Shapiro Wilk Test
Jarque Bera Test
Autocorrelation Test
Durbin Watson Test
Breusch Pagan Test
Heteroscedasticity Test
import statsmodels.stats.api as sts
Goldfeld Quandt Test
Breusch Pagan Test
Non Linearity Test
import statsmodels.stats.api as sts
Linear Rainbow Test
41. Auto Correlation of Residuals
41
Auto Correlation of Errors means that the errors are Correlated.
Assumption is that the Linear Regression Model Residuals are Not Correlated.
Test of Assumption – Durbin Watson Test
Package - statsmodels.stats.stattools.durbin_watson(resids, axis=0)
What is the Null Hypothesis
The Null Hypothesis of the test is that there is no serial correlation.
Statistics (Always between 0 and 4)
• The test statistic is equal to 2*(1-r) where r is the sample autocorrelation of the
residuals.
• Thus for r==0 indicating no serial correlation, the test statistic equals 2.
• Closer to 0, more evidence for positive serial correlation and closer to 4 indicates
negative serial correlation.
42. Assessing Goodness of Fit - R2
42
After fitting the model, it becomes essential to understand how well the model fits the data.
When the Model Fits Best on the Data?
A Model fits the data well if the difference between the actual value and the model’s predicted value is
small and unbiased.
What is R-Squared (R2)?
It is a statistical measure of how close the data is to the fitted regression line. The definition of R-
squared is fairly straight-forward; it is the percentage of the response variable variation that is
explained by a linear model. Or:
R-squared = Explained variation / Total variation
R-squared is always between 0 and 100%:
0% indicates that the model explains none of the variability of the response data around its mean.
100% indicates that the model explains all the variability of the response data around its mean.
In general, the higher the R-squared, the better the model fits your data.
44. Interview Questions - Solution
44
True-False: Linear Regression is a supervised machine learning algorithm.
A) TRUE
B) FALSE
Solution A: Yes, Linear regression is a supervised learning algorithm because it uses true labels
labels for training. Supervised learning algorithm should have input variable (x) and an
variable (Y) for each example.
45. Interview Questions
45
02. Which of the following methods do we use to find the best fit line for data in Linear
Regression?
A) Least Square Method
B) Maximum Likelihood
C) Both A and B
46. Interview Questions - Solution
46
02. Which of the following methods do we use to find the best fit line for data in Linear
Regression?
A) Least Square Method
B) Maximum Likelihood
C) Both A and B
Solution - A: In Linear Regression, we use the Least Square Method to identify the
Best Fit Line.
47. Interview Questions
47
03. Which of the following evaluation metrics can be used to evaluate a model while
modelling a continuous output variable?
A) AUC - ROC
B) Accuracy
C) Mean Squared Error
48. Interview Questions - Solution
48
03. Which of the following evaluation metrics can be used to evaluate a model while
modelling a continuous output variable?
A) AUC - ROC
B) Accuracy
C) Mean Squared Error
Solution - Since Linear Regression gives the output as continuous values and hence
we use Mean Squared Error metric to evaluate the model performance. Rest of the
options are used in case of classification problem.
49. Interview Questions
49
05. Which of the following statements is true about outliers in Linear Regression
A) Linear Regression is sensitive to outliers
B) Linear Regression is not sensitive to outliers
C) No Idea.
D) None of these
50. Interview Questions
50
05. Which of the following statements is true about outliers in Linear Regression
A) Linear Regression is sensitive to outliers
B) Linear Regression is not sensitive to outliers
C) No Idea.
D) None of these
Solution A: The slope of regression line will change if outliers are present in the data.
Therefore, it is sensitive to the Outliers.
51. Linear Regression Example
51
Objectives:
1. Create Linear Regression Model in Excel or R.
2. Find the Linear Regression Equation
3. Make Predictions
4. Create a Residual Plot and Find if Linear Regression is an Absolute Fit or not
x
1
2
3
4
5
Y
2
1
3.5
3
4.5
2
1
3.5
3
4.5
y = 0.7x + 0.7
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
0 1 2 3 4 5 6
Scatterplot X & Y
52. Assessing Model Fit
52
Residuals
The distance of the point (Actual Line) with the Line (Line of Prediction)
Root Mean Squared Error (RMSE)
“Residual Standard Error” in Linear Model Output. It is interpreted as how far on an
average, the residuals are from zero.
Mean Absolute Error(MAE)
Mean Absolute Error is another metric to evaluate the model. For example actual y is 10
and predictive y is 30. the resultant MAE would be (30-10) = 20. MAE is very robust
against the effect of Outliers.
R Square
This metric explains the percentage of variance in the model. It ranges between 0 and 1.
A higher value is always appreciated.
Since R Square increases as the new data is introduced (variables) regardless of the
fact that new variable is actually adding new information to the model. To overcome this,
we look upto Adj. R Square which is steady and only inc or dec if the newly added
variable is truly useful.
Adjusted R Square
53. Difference Between Linear and Logistic Regression
53
Linear Regression Logistic Regression
The Data is Modelled using a Straight Line
A statistical model that predicts the
probability of an outcome that can have two
values
The Outcome (Dependent Variable) is
continuous in Nature
The Outcome (Dependent Variable) has only
limited no of possible values
Output Variable is continuous Output Variable is Discrete
Used to Solve Regression Problems
Used to solve the classification problems
(Binary Classification)
Estimate the Dependent Variable when there
is a change in Independent Variable
Calculates Probability of Occurrence of an
Event
Linear regression uses ordinary least
squares method to minimise the errors and
arrive at a best possible fit
Logistic Regression uses maximum
likelihood method to arrive at the solution
Uses a Straight Line Uses a S Curve or Sigmoid Function
Example - Predicting Sales, House Prices,
GDP etc
Predicting if email is Spam or not, credit card
transaction is fraud or not or customer will
buy the product or not
55. Logistics Regression
55
• Logistics Regression is used when the dependent variable is categorical.
• The values are strictly in the range of 0 and 1.
• It is used to describe data and to explain the relationship between one dependent
binary variable and one or more nominal, ordinal, interval or ratio- level independent
variables.
56. Logistics Regression Equation
56
• Here g() is the link function. The role of link function is to ‘link’ the expectation of y to linear predictor.
• E(y) is the expectation of target variable
• α + βx1 + γx2 – linear predictors. (α,β,γ to be predicted).
Fundamental equation of generalized linear model:
g(E(y)) = α + βx1 + γx2
Lets take a simple linear regression equation with dependent variable enclosed in a link function:
g(y) = βo + β(Age)
Here g() function is trying to establish probability of success (p) or probability of failure(1-p)
Criteria for p
• It must always be positive (p>=0)
• It must always be less than equal to 1 (p<=1)
p = exp(βo + β(Age))
Since probability must always be positive, we’ll put the linear equation in exponential form.
For any value of slope and dependent variable, exponent of this equation will never be negative.
57. Logistics Regression Equation
57
p = exp(βo + β(Age))
In order to make the probability less than 1, we must divide p by a number greater than p.
p = exp(βo + β(Age)) / exp(βo + β(Age))+1
Since g(y) = βo + β(Age) and p = exp(βo + β(Age)), this gives a new equation called
p = e^y / 1+ e^y
This is the Logit Function.
The probability of success is given by p = e^y / 1+ e^y and probability of failure is given by
q = 1-p or q = 1 - e^y / 1+ e^y
On dividing both the equations, we get the following:
Or
58. Logistics Regression equation
58
Final Equation
Log(p/1-p) is the link function and (p/1-p) is the odd ratio. If the log of odd ratio is found positive,
The probability of success is always more than 50%.
59. Sigmoid Function
59
• The Sigmoid Function also called Logistics Function
gives S shape that can take any real value and map
into a value between 0 and 1.
• The range of the values is between 0 and 1.
• If the output of the sigmoid function is more than
0.5, we classify the outcome as 1 or Yes.
• If the output of the sigmoid function is less than 0.5,
we classify the outcome as 0 or No.
60. ROC Curve
60
• ROC Is a probability curve and AUC represents degree or measure of separability.
• Higher the AUC better the model is.
• Here TPR is called the True Positive Rate aka Recall or Sensitivity and is given by TP /
(TP+FN).
• Specificity = TN / (TN + FP) and FPR (False Positive Rate) is 1 – Specificity.
61. What is a Confusion Matrix
Image Source: Google
Y
Actual
Values
Y hat (Predicted Values)
Lets Say we are predicting the presence of disease
which means yes – they have disease and no means –
they don’t.
1. The classifier made total 165 predictions.
2. Out of those 165 cases, the classifier predicted "yes"
110 times, and "no" 55 times.
3. In reality, 105 patients in the sample have the
disease, and 60 patients do not.
Lets understand the basic terms
• True positives
• True Negatives
• False positives
• False Negatives
Confusion Matrix
62. Performance of Logistic Regression Model
62
1. AIC (Akaike Information Criteria) – The analogous metric of adjusted R² in logistic
regression is AIC. AIC is the measure of fit which penalizes model for the number of
model coefficients. Therefore, we always prefer model with minimum AIC value.
2. Confusion Matrix : It is the tabular representation of Actual vs Predicted Values. It
helps us find the accuracy of the model.
The accuracy is calculated using the following equation
3. ROC Curve:
• The ROC Curve is known as Receiver Operating Characteristic Curve. It evaluates
the trade off between true positive rate and false positive rate.
• It is advisable to assume p>0.5 (threshold value) since, we are more concerned
about the success of the model.
• Higher the area under the curve, better the prediction power of the model would be.
63. Logistics Regression Assumptions
63
• Logistics Regression does not need any linear relationship between the dependent
and independent variables.
• The error (residuals) need not be normally distributed.
• There should be little to no multicollinearity amongst the independent variables.
• The outcome is binary variable like yes or no, 1 or 0, positive or negative etc.
• For a Binary Regression, the factor level 1 of the dependent variable should represent
the desired outcome.
• There is a linear relationship between the logit of the outcome and each predictor
variables. Recall that the logit function is logit(p) = log(p/(1-p)), where p is the
probabilities of the outcome.
• Logistics Regression requires quite large sample sizes.
65. Forward Selection
65
1. Its a process which begins with empty model and keeps adding variables one by one.
2. Begins with Intercept. Tests are performed to find the relevant variables as ‘best’ variables.
3. The best variable shall return the highest coefficient of Determination or R-Squared Value.
4. This process keeps going and once the model no longer improves the accuracy by adding more
variables, the process stops.
5. Several Criterions are used to determine which variable goes in – lowest RMSE on cross
validation, F Test Score or lowest P Value.
Tip: While using Forward Selection, in order to test the accuracy of the model, its better to use the trained
Classifier against test data to make predictions.
66. Backward Selection
66
1. Its a process which begins with all variables and keeps removing predictors one by one.
2. Removing the variable with the largest p-value | meaning the variable that is least significant.
3. The new (p-1) variable model is a better model where the largest p value is removed.
4. This process keeps going and once the model has significant p value defined, we may stop the
process.
5. Several Criterions are used to determine which variable goes out – lowest RMSE on cross
validation, F Test Score or lowest P Value
67. Let’s review some concepts
Linear Regression Assumptions of
Linear Regression
Difference Between
Linear and Logistics
Regression
Logistics
Regression
Diagnostic Plots Forward and
Backward
Elimination
67
68. Thanks!
Any questions?
You can find me at Linkedin
@mkschauhan
mukul.mschauhan@gmail.com
68
https://www.linkedin.com/pulse/data-visualisation-using-seaborn-mukul-kr-singh-chauhan
https://www.linkedin.com/pulse/introduction-ggplot-series-mukul-kr-singh-chauhan/
https://www.linkedin.com/pulse/what-data-science-mukul-chauhan/