Linear regression
Nilanjana Pradhan
Case study 1
 Problem Statement:
 A digital media company (similar to Voot, Hotstar,
Netflix, etc.) had launched a show. Initially, the show
got a good response, but then witnessed a decline in
viewership. The company wants to figure out what
went wrong.
 Approach:
 We are concerned about determining the driver
variable for show viewership. This is the case of
prediction rather than projection where we are more
interested in predicting the key driver variables and
their impact rather than forcasting the results.
 First we will list down the potential reasons for the
decline in viewership.
 The potential reasons could be:
 Decline in the number of people coming to the
platform
 Fewer people watching the video
 A Decrease in marketing spend?
 Competitive shows, e.g. cricket/ IPL
 Special holidays
 Twist in the story
Data
 We have been given data for the period of 1 March 2017
to 19 May 2017.
With Columns as
Views_show : Number of times the show was viewed
Visitors : Number of visitors who browsed the platform,
but not necessarily watched a video.
Views_platform : Number of times a video was viewed on
the platform
 Ad_impression : Proxy for marketing budget.
Represents number of impressions generated by ads
Cricket_match_india: If a cricket match was being
played. 1 indicates match on a given day, 0 indicates
there wasn't
Character_A : Describes presence of Character A. 1
indicates character A was in the episode, 0 indicates
she/he wasn't
 QUESTIONS:
 DEPENDENT VARIABLES?
 INDEPENDENT VARIABLES?
Case study 2
 Suppose you are an HR professional and want to
determine:
 Whether age of an employee has a substantial effect
on their maturity
 The importance of experience and capability on
remuneration
 The importance of IQ (Intelligence Quotient) vs. EQ
(Emotional Quotient) on problem handling
capability
 How sedentary lifestyle at workplace affects
employee output
 If a specific physical activity makes employees more
energetic and lively at the workplace
 All these are routine scenarios in an organization.
But their impact is huge. How, as an HR
professional, can you determine which variables
have what impact on employee productivity?
 Regression analysis offers you the answer. It helps
you explain the relationship between two or more
variables
 DEPENDENT VARIABLES?
 INDEPENDENT VARIABLES?
 The white line connecting all the dots in the graph
above represents the error or prediction. But you
now want to find the best-fitted line of regression to
minimize the error of prediction. The aim is to help
find the best-fitted line of regression.
 The linear regression model is used when there is a
linear relationship between dependent and
independent variables. When the value of a
dependent variable is based on multiple variables
(more than one), we use multiple regression
analysis.
Regression Analysis in Machine learning
Regression analysis is a statistical method to model
the relationship between a dependent (target) and
independent (predictor) variables with one or more
independent variables.
 More specifically, Regression analysis helps us to
understand how the value of the dependent variable
is changing corresponding to an independent
variable when other independent variables are held
fixed.
 It predicts continuous/real values such
as temperature, age, salary, price, etc.
 We can understand the concept of regression
analysis using the below example:
 Example: Suppose there is a marketing company A,
who does various advertisement every year and get
sales on that.
The below list shows the advertisement made by the company in the last 5 years and the
corresponding sales:
 Now, the company wants to do the advertisement of
$200 in the year 2019 and wants to know the
prediction about the sales for this year.
 So to solve such type of prediction problems in
machine learning, we need regression analysis.
 Regression is a supervised learning technique which
helps in finding the correlation between variables
and enables us to predict the continuous output
variable based on the one or more predictor
variables.
 It is mainly used for prediction, forecasting,
time series modeling, and determining the
causal-effect relationship between variables.
 In Regression, we plot a graph between the variables
which best fits the given datapoints, using this plot,
the machine learning model can make predictions
about the data.
 Some examples of regression can be as:
 Prediction of rain using temperature and other
factors
 Determining Market trends
 Prediction of road accidents due to rash driving.
Terminologies related to the regression Analysis
 Dependent Variable: The main factor in Regression
analysis which we want to predict or understand is called
the dependent variable. It is also called target variable.
 Independent Variable: The factors which affect the
dependent variables or which are used to predict the
values of the dependent variables are called independent
variable, also called as a predictor.
 Outliers: Outlier is an observation which contains
either very low value or very high value in comparison to
other observed values. An outlier may hamper the result,
so it should be avoided.
Why do we use Regression Analysis?
 As mentioned above, Regression analysis helps in the
prediction of a continuous variable.
 There are various scenarios in the real world where
we need some future predictions such as weather
condition, sales prediction, marketing trends, etc.,
for such case we need some technology which can
make predictions more accurately.
 So for such case we need Regression analysis which
is a statistical method and used in machine learning
and data science. Below are some other reasons for
using Regression analysis:
 Regression estimates the relationship between the
target and the independent variable.
 It is used to find the trends in data.
 It helps to predict real/continuous values.
 By performing the regression, we can confidently
determine the most important factor, the least
important factor, and how each factor is
affecting the other factors.
Types of Regression
 There are various types of regressions which are used
in data science and machine learning. Each type has
its own importance on different scenarios, but at the
core, all the regression methods analyze the effect of
the independent variable on dependent variables.
Here we are discussing some important types of
regression which are given below:
 Linear Regression
 Logistic Regression
 Support Vector Regression
 Decision Tree Regression
 Random Forest Regression
Linear Regression
 Linear regression is a statistical regression method
which is used for predictive analysis.
 It is one of the very simple and easy algorithms
which works on regression and shows the
relationship between the continuous variables.
 It is used for solving the regression problem in
machine learning.
 Linear regression shows the linear relationship
between the independent variable (X-axis) and the
dependent variable (Y-axis), hence called linear
regression.
 If there is only one input variable (x), then such
linear regression is called simple linear
regression. And if there is more than one input
variable, then such linear regression is
called multiple linear regression.
 The relationship between variables in the linear
regression model can be explained using the below
image. Here we are predicting the salary of an
employee on the basis of the year of experience.
 Below is the mathematical equation for Linear
regression:
 Y= aX+b
 Here, Y = dependent variables (target
variables),
X= Independent variables (predictor
variables),
a and b are the linear coefficients
Some popular applications of linear regression are:
 Analyzing trends and sales estimates
 Salary forecasting
 Real estate prediction
 Arriving at ETAs in traffic.
Using Regression Analysis to Drive
Ecommerce Sales
 Have you ever wondered what drives your sales?
 Ecommerce businesses typically know the source of
their revenue.
 We can drill down to find specific sales drivers:
 Using regression analysis, a business can determine
subtle causes, such as:
 The social media channel that impacts sales more.
 The amount that sales should increase after a bump
in marketing spend.
 Whether free shipping or discounts contribute more
to sales.
 Whether one product category should be marketed
aggressively.
Dependent variable: Sales
Independent Variables:promotion in social media,free
shipping discounts
Regression Model
 Businesses use regression models to understand how
changes in a set of independent variables affect a
dependent one.
 For ecommerce businesses, the dependent variable is
often sales. It can also be conversion rates.
 The independent variables could be:
 email sends
 expenditures on social media and search engine
optimization
The regression model lets business owners measure,
one at a time, each independent variable’s impact on
sales.
 In other words, a regression model can predict, say,
how much a 20 percent increase in Facebook ad
spend will increase sales.
 It can use past sales and, perhaps, weather data by
date to predict how a coming storm will slow or
speed sales.
 It can also give you an idea of the increase or
decrease in sales resulting from additional email
sends — a decrease would indicate subscriber
annoyance.
An ecommerce owner needs only historical sales and cost of SEO to predict how SEO spend
impacts revenue, as depicted on this chart.
 A simple regression formula could be:
 Y = A+B(X)
 Y is the dependent variable — sales, email signups.
 X is the value of the independent variable —
Facebook ads, email frequency.
 B is a constant that reflects how much Y changes for
every value of X. (Getting an accurate number may
require a mathematician or an app.)
 A is a constant that equals the value of Y when X is
zero. Determine A by plugging 0 into X.

Linear regression

  • 1.
  • 2.
    Case study 1 Problem Statement:  A digital media company (similar to Voot, Hotstar, Netflix, etc.) had launched a show. Initially, the show got a good response, but then witnessed a decline in viewership. The company wants to figure out what went wrong.
  • 3.
     Approach:  Weare concerned about determining the driver variable for show viewership. This is the case of prediction rather than projection where we are more interested in predicting the key driver variables and their impact rather than forcasting the results.  First we will list down the potential reasons for the decline in viewership.
  • 4.
     The potentialreasons could be:  Decline in the number of people coming to the platform  Fewer people watching the video  A Decrease in marketing spend?  Competitive shows, e.g. cricket/ IPL  Special holidays  Twist in the story
  • 5.
    Data  We havebeen given data for the period of 1 March 2017 to 19 May 2017. With Columns as Views_show : Number of times the show was viewed Visitors : Number of visitors who browsed the platform, but not necessarily watched a video. Views_platform : Number of times a video was viewed on the platform
  • 6.
     Ad_impression :Proxy for marketing budget. Represents number of impressions generated by ads Cricket_match_india: If a cricket match was being played. 1 indicates match on a given day, 0 indicates there wasn't Character_A : Describes presence of Character A. 1 indicates character A was in the episode, 0 indicates she/he wasn't
  • 7.
     QUESTIONS:  DEPENDENTVARIABLES?  INDEPENDENT VARIABLES?
  • 8.
    Case study 2 Suppose you are an HR professional and want to determine:  Whether age of an employee has a substantial effect on their maturity  The importance of experience and capability on remuneration  The importance of IQ (Intelligence Quotient) vs. EQ (Emotional Quotient) on problem handling capability  How sedentary lifestyle at workplace affects employee output
  • 9.
     If aspecific physical activity makes employees more energetic and lively at the workplace  All these are routine scenarios in an organization. But their impact is huge. How, as an HR professional, can you determine which variables have what impact on employee productivity?  Regression analysis offers you the answer. It helps you explain the relationship between two or more variables
  • 10.
     DEPENDENT VARIABLES? INDEPENDENT VARIABLES?
  • 12.
     The whiteline connecting all the dots in the graph above represents the error or prediction. But you now want to find the best-fitted line of regression to minimize the error of prediction. The aim is to help find the best-fitted line of regression.
  • 14.
     The linearregression model is used when there is a linear relationship between dependent and independent variables. When the value of a dependent variable is based on multiple variables (more than one), we use multiple regression analysis.
  • 15.
    Regression Analysis inMachine learning Regression analysis is a statistical method to model the relationship between a dependent (target) and independent (predictor) variables with one or more independent variables.
  • 16.
     More specifically,Regression analysis helps us to understand how the value of the dependent variable is changing corresponding to an independent variable when other independent variables are held fixed.  It predicts continuous/real values such as temperature, age, salary, price, etc.
  • 17.
     We canunderstand the concept of regression analysis using the below example:  Example: Suppose there is a marketing company A, who does various advertisement every year and get sales on that.
  • 18.
    The below listshows the advertisement made by the company in the last 5 years and the corresponding sales:
  • 19.
     Now, thecompany wants to do the advertisement of $200 in the year 2019 and wants to know the prediction about the sales for this year.  So to solve such type of prediction problems in machine learning, we need regression analysis.
  • 20.
     Regression isa supervised learning technique which helps in finding the correlation between variables and enables us to predict the continuous output variable based on the one or more predictor variables.  It is mainly used for prediction, forecasting, time series modeling, and determining the causal-effect relationship between variables.
  • 21.
     In Regression,we plot a graph between the variables which best fits the given datapoints, using this plot, the machine learning model can make predictions about the data.
  • 22.
     Some examplesof regression can be as:  Prediction of rain using temperature and other factors  Determining Market trends  Prediction of road accidents due to rash driving.
  • 23.
    Terminologies related tothe regression Analysis  Dependent Variable: The main factor in Regression analysis which we want to predict or understand is called the dependent variable. It is also called target variable.  Independent Variable: The factors which affect the dependent variables or which are used to predict the values of the dependent variables are called independent variable, also called as a predictor.  Outliers: Outlier is an observation which contains either very low value or very high value in comparison to other observed values. An outlier may hamper the result, so it should be avoided.
  • 24.
    Why do weuse Regression Analysis?  As mentioned above, Regression analysis helps in the prediction of a continuous variable.  There are various scenarios in the real world where we need some future predictions such as weather condition, sales prediction, marketing trends, etc., for such case we need some technology which can make predictions more accurately.
  • 25.
     So forsuch case we need Regression analysis which is a statistical method and used in machine learning and data science. Below are some other reasons for using Regression analysis:
  • 26.
     Regression estimatesthe relationship between the target and the independent variable.  It is used to find the trends in data.  It helps to predict real/continuous values.  By performing the regression, we can confidently determine the most important factor, the least important factor, and how each factor is affecting the other factors.
  • 27.
    Types of Regression There are various types of regressions which are used in data science and machine learning. Each type has its own importance on different scenarios, but at the core, all the regression methods analyze the effect of the independent variable on dependent variables. Here we are discussing some important types of regression which are given below:
  • 28.
     Linear Regression Logistic Regression  Support Vector Regression  Decision Tree Regression  Random Forest Regression
  • 29.
    Linear Regression  Linearregression is a statistical regression method which is used for predictive analysis.  It is one of the very simple and easy algorithms which works on regression and shows the relationship between the continuous variables.  It is used for solving the regression problem in machine learning.  Linear regression shows the linear relationship between the independent variable (X-axis) and the dependent variable (Y-axis), hence called linear regression.
  • 30.
     If thereis only one input variable (x), then such linear regression is called simple linear regression. And if there is more than one input variable, then such linear regression is called multiple linear regression.  The relationship between variables in the linear regression model can be explained using the below image. Here we are predicting the salary of an employee on the basis of the year of experience.
  • 32.
     Below isthe mathematical equation for Linear regression:  Y= aX+b
  • 33.
     Here, Y= dependent variables (target variables), X= Independent variables (predictor variables), a and b are the linear coefficients
  • 34.
    Some popular applicationsof linear regression are:  Analyzing trends and sales estimates  Salary forecasting  Real estate prediction  Arriving at ETAs in traffic.
  • 35.
    Using Regression Analysisto Drive Ecommerce Sales  Have you ever wondered what drives your sales?  Ecommerce businesses typically know the source of their revenue.
  • 36.
     We candrill down to find specific sales drivers:  Using regression analysis, a business can determine subtle causes, such as:  The social media channel that impacts sales more.
  • 37.
     The amountthat sales should increase after a bump in marketing spend.  Whether free shipping or discounts contribute more to sales.  Whether one product category should be marketed aggressively.
  • 38.
    Dependent variable: Sales IndependentVariables:promotion in social media,free shipping discounts
  • 39.
    Regression Model  Businessesuse regression models to understand how changes in a set of independent variables affect a dependent one.
  • 40.
     For ecommercebusinesses, the dependent variable is often sales. It can also be conversion rates.  The independent variables could be:  email sends  expenditures on social media and search engine optimization The regression model lets business owners measure, one at a time, each independent variable’s impact on sales.
  • 41.
     In otherwords, a regression model can predict, say, how much a 20 percent increase in Facebook ad spend will increase sales.  It can use past sales and, perhaps, weather data by date to predict how a coming storm will slow or speed sales.  It can also give you an idea of the increase or decrease in sales resulting from additional email sends — a decrease would indicate subscriber annoyance.
  • 42.
    An ecommerce ownerneeds only historical sales and cost of SEO to predict how SEO spend impacts revenue, as depicted on this chart.
  • 43.
     A simpleregression formula could be:  Y = A+B(X)  Y is the dependent variable — sales, email signups.  X is the value of the independent variable — Facebook ads, email frequency.  B is a constant that reflects how much Y changes for every value of X. (Getting an accurate number may require a mathematician or an app.)  A is a constant that equals the value of Y when X is zero. Determine A by plugging 0 into X.