Regression and correlation

Data Analytics
By,
Vrushali Solanke.

Basics of Data Analytics:
 Analytics:
i)It is the systematic computational analysis of data.
ii)It is the discovered , interpretation and communication of meaningful
pattern in a data.
iii)It relies on the simultaneous application of statistics, computer
programming and operation research to quantify the performance.
Data Analytics: It is the science of examine raw data with the purpose of
drawing conclusion .
Data Analytics: It is a process of inspecting, cleansing, transforming, and
modelling data with the goal of discovering useful information, informing
conclusion, and supporting decision making.

Need of Data analytics:
 Data and information are increasing rapidly, so that information available
to us in future is unpredictable.
 It is crucial to integrate this data. If it get wasted, lots of valuable
information will be lost.
 Previously, skilled analyst is required for processing the data; but these
day, massive amount of data processing is not possible for human being.
 So there is a need for the tools which operate at high speed and efficiency
on this data and helps the business for making better decision.
 So, Data Analytics is important.

What is Data Analytics?
 It is the quantitative or qualitative techniques.
 It is the science of drawing insights from raw information source.
 It encompasses many diverse types of data analysis.
 It is primarily conducted in business to consumer (B2C)application.

Data analytics vs. Data analysis
Analysis
Explore potential future events

Overview of Data analytics Lifecycles
 Data Analytica is the science of examining raw data with the purpose of drawing
conclusions about the information.
 There are the 6 phases in lifecycle of data analytics:
1. Discovery:
i)The team learn business domain.
ii)The accesses the resources available to support the project in terms of people,
technology, time and data.
iii)Framing the business problem as an analytics challenge that can be addressed in
subsequent phases and formulating initial hypothesis to test and begin learning initial data.
2. Data Preparation :
i)Here team requires analytical sandbox. In which team works with data and perform
analytics in project.
ii)Teams needs to execute Extract, Load, Transform(ETLT) process. Data should
transformed in ETLT process so team can work with it and analyze it.
iii)It include the steps to explore, processes, and condition data prior to modeling and
analytics

3. Model Planning:
i)Here teams determine methods, techniques, and workflow it intends to follow for the
subsequent model building phase.
ii)The team explore data to learn about the relationships between variables.
4. Model Building:
i)In this phase, team develops dataset for testing, training, and production purpose.
ii)Team execute Model based on work done in model planning phase.
iii)Team find out whether existing tools will be sufficient for running the model or if it will
need more robust environment for executing models and workflow.
5. Communicate Results:
i)Here team, in collaboration with stakeholder, determine if the result for the project are
success or failure based on the criteria developed in phase 1.
ii)Team should identify key finding, quantify the business value, and develop a narrative to
summarize and convey findings to stakeholders.
6. Optimization:
i)Team deliver the final report, briefings, code and technical documents.
ii)The team may run a pilot project to implement the model in a production environment.

Importance of Data Analytics for Business;
1. Improving efficiency:
2. Market Understandings:
3. Cost Reduction:
4. Faster and Better Decision Making:
5. New Products/ Services:
6. Industry knowledge:
7. Witnessing the opportunity:

Difference between Data Science and Data Analytics
Sr.
o.
Terms Data Science Data Analytics
1 Scope Macro Micro
2 Focus on Providing strategic actionable
insights into the world
Providing operational observation
into issues
3 Skills required Mathematical, technical and
strategic knowledge is necessary
Data analytics and visualization skills
required.
4 Big data Deal with big data Not necessary to deal with big data
5 Major fields Machine learning, AI, Search
engine engineering, corporate
analytics.
Healthcare, gaming, travel, industries
with immediate data needs.

 What Are Diagnostic Analytics?
 Diagnostic analytics are a form of advanced analytics that focus on explaining why something
has happened based on data analysis. Like a doctor investigating a patient’s symptoms, they aim
to understand the underlying issues and determine why an issue is happening.
 Its capabilities allow users to identify anomalies by highlighting areas that could require further
study, which are pinpointed when trends or data points raise questions that can’t be answered
easily or without digging deeper. Some questions that would have to be addressed with
diagnostic analytics include:
• Why did this marketing campaign fail?
• Why have sales increased without any increased marketing attention for a certain region?
• Why did employee performance fall during this month?
 As well as other questions that have no obvious answer from a single data source.
 Diagnostic analytics offer data discovery, drill-down, data mining and data correlation. Drilling
down into the data allows users to identify potential sources for the anomalies discovered in the
first step. Analysts can use these capabilities to examine patterns both within and external to the
data to draw an informed conclusion. Probability theory, filtering, regression analytics and time-
series data analysis are all useful tools related to diagnostic analytics to facilitate this process.

 What Are Descriptive Analytics?
 It describe the or summarize the raw data and make it something that is interpretable by
humans.
 Simpler way is to define descriptive analytics is ,it answer the question “What has
happened?”
 Descriptive analytics are useful because they allows us to learn from past behaviours and
understand how they might influence future outcomes.
 The main objective of descriptive analytics is to find out the reason behind precious
success or failure in the past.
 Common example is, Descriptive analytics are the reports that provide historical insights
regarding the company’s production, financials, operations, sales, inventory and customers.
 Most of the social analytics are the descriptive analytics. They summarize certain grouping
based on simple counts of events. Like number of followers, likes, post fans .

 What Are Predictive Analytics?
 Predictive and descriptive analytics have oppositional objectives, but they’re very closely
related. This is because you need accurate information about the past to make predictions
for the future. Predictive tools attempt to fill in gaps in the available data. If descriptive
analytics answer the question, “what happened in the past,” predictive analytics answer the
question, “what might happen in the future?”
 Predictive analytics take historical data from various systems and use it to highlight
patterns. Then, algorithms, statistical models and machine learning are employed to
capture the correlations between targeted data sets.
 The most common commercial example is a credit score. Banks uses historical information
to predict whether or not a candidate is likely to keep up with payments. It works in much
the same way for manufacturers, except that they’re usually trying to find out if products
will sell. Predictive analytics focus on the future of the business.
 Predictive analytics can be used through out the organization, from forecasting customer
behavior and purchasing pattern to identify trends in sale activities.

 What Are Prescriptive Analytics?
 Of diagnostic, predictive, descriptive, and prescriptive analytics, the latter is the most
recent addition to the business intelligence landscape. These tools enable companies to
view potential decisions and, based on both current and historical data, follow them
through to a likely outcome. Provide recommendation regarding actions that will take
advantages of the prediction.
 Like predictive analytics, prescriptive analytics won’t be right 100% of the time, because
they work with estimates. However, they provide the best way of “seeing into the future”
and determining the viability of decisions before they’re made.
 The difference between the two is that prescriptive analytics offers opinions as to why a
particular outcome is likely. They can then offer recommendations based on this
information. To achieve this, they use algorithms, machine learning and computational
modeling.
 If predictive analytics answers, “What might happen?” then prescriptive analytics
answers, “What do we have to do to make it happen?” or “How will this action change the
outcome?” Prescriptive deals more with trial and error and has a bit of a hypothesis-testing
nature to it.

 Summary of the Different Types
 Diagnostic analytics ask about the present. They drill down into why something has
happened and helps users diagnose issues.
 Descriptive analytics ask about the past. They want to know what has been happening to
the business and how this is likely to affect future sales.
 Predictive analytics ask about the future. These are concerned with what outcomes can
happen and what outcomes are most likely.
 Finally, prescriptive tools ask about the present’s impact on the future. It wants to know
the best course of action for right now in order to positively impact the future. In other
words, they’re the decision makers.

Statistical Inference:
 Statistical inference is a technique by which you can analyze the result and make
conclusions from the given data to the random variations.
 Statistics can be classified into two different categories. The two different types of
Statistics are: 1. Descriptive Statistics 2. Inferential Statistics In Statistics, descriptive
statistics describe the data, whereas inferential statistics help you make predictions
from the data. In inferential statistics, the data are taken from the sample and allows
you to generalize the population. In general, inference means “guess”, which means
making inference about something
 The purpose of statistics is to describe and predict the information.
 The basic principle of Statistical inference is that conclusion about a population of
interest can be made using information contained in a sample from that population.

 Statistical inference is the procedure through which inference about a population are made
based on certain characteristics calculated from a sample of data drawn from that
population.
 Statistical inference is the process of generating conclusion about a population from a noisy
sample. Without Statistical inference we simply living in data, but with Statistical inference
we are trying to generate knowledge.
Definition of Statistical inference :It is the method of drawing and measuring the reliability of
conclusions about population based on information obtained from a sample of the population.
 Statistical inference can be contrasted with exploratory data analysis.
 Statistical inference requires navigating the set of assumption and tools and subsequently
thinking about how to draw conclusion from data.
 Descriptive statistics :It emphasize the role of population quantities of interest, about
which we wish to draw inference. Descriptive statistics are used as a preliminary steps
before formal inference are drawn. A descriptive statistic is a summary statistic that
quantitatively describes or summarizes features from a collection of information.
 The conclusion of statistical inference is a statistical proposition.

 There are two broad areas of Statistical inference :
1)statistical estimation
2)Statistical hypothesis testing.
1) Statistical estimation: It is concerned with best estimating the value or range of values for
a particular population parameter. There are two types of statistical estimation:
i)Point estimation: Here ,we estimate an unknown parameter using a single number that
is calculated from the sample data. In statistics, point estimation involves the use of sample
data to calculate a single value which is to serve as a "best guess" or "best estimate" of an
unknown population parameter.
ii)Interval estimation: Here, we estimate an unknown parameter using an interval of
values that is likely to contain the true value of that parameter.
Interval estimation, in statistics, the evaluation of a parameter—for example, the mean
(average)—of a population by computing an interval, or range of values, within which the
parameter is most likely to be located.
2)Hypothesis testing: It is concerned with deciding whether the study data are consistent at
some level of agreement with a particular population parameter. In Hypothesis testing we begin
with a claim about the population(called it as Null Hypothesis), and check whether or not the
data obtained from the sample provide evidence against this claim.

Population:
 In statistics as well as in quantitative methodology, the set of data are collected and selected from a
statistical population with the help of some defined procedures. There are two different types of data
sets namely, population and sample. So basically when we calculate the mean deviation, variance
and standard deviation, it is necessary for us to know if we are referring to the entire population or
to only sample data. Suppose the size of the population is denoted by ‘n’ then the sample size of that
population is denoted by n -1. Let us take a look of population data sets and sample data sets in
detail.
 Population : It includes all the elements from the data set and measurable characteristics of the
population such as mean and standard deviation are known as a parameter. For example, All
people living in India indicates the population of India.
 There are different types of population. They are:
• Finite Population
• Infinite Population
• Existent Population
• Hypothetical Population

Let us discuss all the types one by one.
Finite Population
 The finite population is also known as a countable population in which the population can be counted. In other
words, it is defined as the population of all the individuals or objects that are finite. For statistical analysis, the
finite population is more advantageous than the infinite population. Examples of finite populations are
employees of a company, potential consumer in a market.
Infinite Population
 The infinite population is also known as an uncountable population in which the counting of units in the
population is not possible. Example of an infinite population is the number of germs in the patient’s body is
uncountable.
Existent Population
 The existing population is defined as the population of concrete individuals. In other words, the population
whose unit is available in solid form is known as existent population. Examples are books, students etc.
Hypothetical Population
 The population in which whose unit is not available in solid form is known as the hypothetical population. A
population consists of sets of observations, objects etc that are all something in common. In some situations,
the populations are only hypothetical. Examples are an outcome of rolling the dice, the outcome of tossing a
coin.

Sample
It includes one or more observations that are drawn from the population and the measurable characteristic of a
sample is a statistic. Sampling is the process of selecting the sample from the population.
For example, some people living in India is the sample of the population.
Basically, there are two types of sampling. They are:
•Probability sampling
•Non-probability sampling
Probability Sampling
In probability sampling, the population units cannot be selected at the discretion of the researcher. This can be dealt
with following certain procedures which will ensure that every unit of the population consists of one fixed probability
being included in the sample. Such a method is also called random sampling. Some of the techniques used for
probability sampling are:
•Simple random sampling
•Cluster sampling
•Stratified Sampling
•Disproportionate sampling
•Proportionate sampling
•Optimum allocation stratified sampling
•Multi-stage sampling
Non Probability Sampling
In non-probability sampling, the population units can be selected at the discretion of the researcher. Those samples
will use the human judgements for selecting units and has no theoretical basis for estimating the characteristics of
the population. Some of the techniques used for non-probability sampling are
•Quota sampling
•Judgement sampling
•Purposive sampling

Population and Sample Examples
•All the people who have the ID proofs is the population and a group of people who only have
voter id with them is the sample.
•All the students in the class are population whereas the top 10 students in the class are the
sample.
•All the members of the parliament is population and the female candidates present there is the
sample.
Population and Sample Formulas
We will demonstrate here the formulas for mean absolute deviation (MAD), variance and
standard deviation based on population and given sample. Suppose n denotes the
size of the population and n-1 denotes the sample size, then the formulas for mean absolute
deviation, variance and standard deviation are given by;

Comparison Population Sample
Meaning Collection of all the units
or elements that possess
common characteristics
A subgroup of the
members of the
population
Includes Each and every element
of a group
Only includes a handful
of units of population
Characteristics Parameter Statistic
Data Collection Complete enumeration or
census
Sampling or sample
survey
Focus on Identification of the
characteristics
Making inferences about
the population
Difference between Population and Sample
Some of the key differences between population and sample are clearly given below:

Statistical modeling
1.Statistical Model:
 Definition: A statistical model is a mathematical model that embodies a set of statistical
assumptions concerning the generation of sample data (and similar data from a larger population).
 Statistical model is a combination of inference based on collected data and population understanding used
to predict information in an idealized form. This means that a statistical model can be an equation or a
visual representation of information based on research that’s already been collected over time.
 Statistical models are the part of the foundation of statistical inference.
 Essentially, all statistical model exist to find inference between different types of variable and because
there are different types of variable, there are different types of statistical model. Some of the types of
model include regression, analysis of variance, analysis of covariance, and chi-square etc.

2.Statistical Modeling:
 Statistical modeling is an approach to statistical data analysis that helps researchers
discovers something about a phenomenon that is assumed to exist. This approach helps
explain the variability found in the dataset.
 It is a strategy which brings together estimation and hypothesis test under the same
umbrella.
 This modeling approach construct summary model that displays current knowledge. The
model are then “fitted” to data.
 A general modelling framework:
Data= Pattern + Residual
Where, Pattern: Systematic or ‘explained’ variation.
Residuals: Leftover or ‘Unexplained’ variation.
In simple term statistical modelling is a simplified, mathematically formalized way to
approximate reality(i.e. what generate your data)and optionally to make prediction from this
approximation.

Basic steps in statistical model building process are:
1. Model selection: in this step plots of data, process knowledge and assumption about the
process are used to determine the form of the model to be fit to the data.
2. Model fitting: Then using selected model and possibly information about data, an
appropriate model fitting method is used to estimate the unknown parameter in the model.
When parameter estimation have been made, them model is carefully assessed to see if
the underlying assumption of the analysis appear possible.If assumption seems valid ,the
model can be used to answer the scientific questions that promoted modeling effort.
3. Model Validation: If the model validation identifies problem with the current model,
then modeling process is repeated using information from the model validation .

Probability Distribution:
 In Statistics, the probability distribution gives the possibility of each outcome of a
random experiment or events. It provides the probabilities of different possible
occurrence.
 To recall, the probability is a measure of uncertainty of various phenomena. Like, if
you throw a dice, what the possible outcomes of it, is defined by the probability. This
distribution could be defined with any random experiments, whose outcome is not sure
or could not be predicted.
Probability Distribution Definition
 Probability distribution yields the possible outcomes for any random event. It is also
defined based on the underlying sample space as a set of possible outcomes of any
random experiment. These settings could be a set of real numbers or a set of vectors or
set of any entities. It is a part of probability and statistics.

1. Probability:
Probability means possibility. It is a branch of mathematics that deals with the occurrence of a
random event. The value is expressed from zero to one. Probability has been introduced in
Maths to predict how likely events are to happen.
The meaning of probability is basically the extent to which something is likely to happen. This
is the basic probability theory, which is also used in the probability distribution, where you will
learn the possibility of outcomes for a random experiment.
To find the probability of a single event to occur, first, we should know the total number of
possible outcomes.
2. Random experiments: Random experiments are defined as the result of an experiment,
whose outcome cannot be predicted.
Suppose, if we toss a coin, we cannot predict, what outcome it will appear either it will come as
Head or as Tail. The possible result of a random experiment is called an outcome. And the set
of outcomes is called a sample point. With the help of these experiments or events, we can
always create a probability pattern table in terms of variable and probabilities.
Probability of event to happen P(E) = Number of favorable outcomes/Total
Number of outcomes

 3. Sample Space:It is the set of all possible outcomes of a random experiments.
 4. Random Variables
It is the variable whose possible values are numerical outcomes of a random experiment.
P(X) represent the probability of X.
P(X=x) refer to probability that the random variable X is equal to a particular value, denoted by x.
Example, P(X=1) refer to probability that random variable X is equal to 1.
Consider an example ,suppose you flip a coin two times. This simple statistics experiments have 4
possibilities :HH, HT, TH, TT. Now let a variable X represent the number of heads that result from
experiment. The variable X has outcome values 0,1 or 2.
Table represent the probability distribution of a random variable X
Number of Heads Probability
0 0.25
1 0.50
2 0.25

Probability Distribution:
A probability distribution is a function that describes the likelihood of obtaining the possible values
that a random variable can assume.
The probability distribution of a random variable X is define as:
Definition : probability distribution of a random variable X is the system of numbers
X : x1 x2 ……… xn
P(X) : p1 p2 ……… pn
Where ,the real numbers x1,x2,….,xn are the possible values of random variable X. The probability of
random variable X taking the value x i.e. P(X=x)=pi.
P(X)= the likelihood that random variable takes a specific value of x. The sum of all probabilities for
all possible values must be equal to 1.
probability distribution may be either discrete or continuous.
A discrete distribution means that X can assume one of a countable (Finite) number of values.
A continuous distribution means that X can assume one of a uncountable (Infinite) number of
values.
A probability distribution is the function that describes the mapping from any realized value of the
random variable, to probability.

1.Discrete probability distribution: Three frequently used discrete distribution are:
i) The Binomial distribution: is used to compute probabilities for a process where only one of
two possible outcomes may occur on each trial.
Example, Here are some examples of Binomial distribution: Rolling a die: Probability of getting the
number of six (6) (0, 1, 2, 3…50) while rolling a die 50 times; Here, the random variable X is the
number of “successes” that is the number of times six occurs. The probability of getting a six is 1/6.
ii)The geometric distribution: You use this distribution to determine the probability that a
specified number of trails will take place before the first success occurs.
Example, Let’s say, the probability that an athlete achieves a distance of 6m in long jump is 0.7.
Geometric distribution can be used to determine probability of number of attempts that the person will
take to achieve a long jump of 6m. In the second attempt, the probability will be 0.3 * 0.7 = 0.21 and
the probability that the person will achieve in third jump will be 0.3 * 0.3 * 0.7 = 0.063
ii)The Poisson distribution: is used to measure the probability that a given number of events will
occur during given time frame.
Example, Let’s say that the number of buses that come on a bus stop in span of 30 minutes is 1.
Poisson distribution can be used to model the probability of different number of buses, X, coming
to the bus stop within the next 30 minutes where X can take value of 0, 1, 2, 3, 4.

2. Continuous probability distribution:
i)Uniform distribution: In statistics, the uniform distribution is a type of
probability distribution in that all the possible outcomes are equally possible. A deck of
cards has uniform distributions within it since the probability of drawing a heart, club,
diamond or spade is equally possible.
ii)Normal Distribution: The normal distribution is the most important probability
distribution in statistics because it fits many natural phenomena.
For example, heights, blood pressure, measurement error, and IQ scores follow the
normal distribution. It is also known as the Gaussian distribution and the bell curve.
In a normal distribution, data is symmetrically distributed with no skew.

Correlation
 If the change in one variable appears to be accompanied by a change in other variable,
the two variables are said to be correlated and this inter-dependence is called correlation
or co-variation.
 Correlation analysis is a method of statistical evaluation used to study the strength of
relationship between two, numerically measured, continuous variables (e.g. height and
weight) type of analysis is useful when we want to establish if there are possible
connection between variables.
 In short, the tendency of simultaneous variation between two variables is called
correlation or co-variation.
 If correlation is found between two variables it means that when there is a systematic
change in one variable, there is also a systematic change in the other; the variables alter
together over a certain period of time.
 If there is correlation found, depending upon the numerical values measured, this can
be either positive or negative.
 The knowledge of correlation gives us an idea of the direction and intensity of change in
a variable when the correlated variable changes.

 Correlation denotes the interdependency among the variables for correlating two
phenomenon, it is essential that the two phenomenons should have cause-effect
relationship and if such relationship does not exist then the two phenomenons cannot be
correlated.
 If two variables vary in such a way that movement in one are accompanied by movement
in other, these variables are called cause and effect relationship.
 Causation always implies correlation but correlation does not necessarily imply causation.
Because there is strong positive or strong negative correlation between two variables, this
does not mean that one variable is caused by the other variable. A strong correlation never
implies a cause-effect relationship between two variables.
 co-efficient of correlation:
 To measure the degree of association or relationship between two variables quantitatively
of relationship is used and is termed as co-efficient of correlation.
 Co-efficient of correlation is a numerical index that tells us to what extent the two variables
are related and to what extent the variations in one variable changes with the variations in
the other. The co-efficient of correlation is always symbolized either by r or p (Rho) range
from(-1 <=r>=1)

 Techniques for Measuring Correlation:
 Three important statistical tools used to measure correlation are: Scatter diagrams, Karl
Pearson's coefficient of correlation, and Spearman's rank correlation.
 1. Scatter Diagram:
 • A scatter diagram visually presents the nature of association without giving any specific
numerical value. In this technique, the values of the two variables are plotted as points on a
graph paper.
 From a scatter diagram, one can get a fairly good idea of the nature of relationship. In a
scatter diagram the degree of closeness of the scatter points and their overall direction
enable us to examine the relationship.
 If all the points lie on a line, the correlation is perfect and is said to be unity. If the scatter
points are widely dispersed around the line, the correlation is low.
 The correlation is said to be linear if the scatter points lie near a line or on a line. Scatter
diagrams spanning in Fig. give us an idea of the relationship between two variables.

 2. Karl Pearson's Coefficient of Correlation:
 A numerical measure of linear relationship between two variables is gi coefficient of
correlation.
 A relationship is said to be linear if it can be represented by a straight line. product
moment correlation and simple correlation coefficient.
 It gives a precise numerical value of the degree of linear relationship between two The
linear relationship may be given by Y = a + bX.
 This type of relation may be described by a straight line. The intercept that line makes on Y
axis is given by a and the slope of the line is given by b. It gives the change in the value of
Y for very small change in the value of X. On the other hand, if the relation cannot be
represented by straight line as in Y = X the value of the coefficient will be zero. It clearly
shows that zero correlation need not mean absence of any type of relation between the
two variables
 The value of the correlation coefficient lies between minus one and plus one, -1 <= r >= 1 .

The product moment correlation or the Karl Pearson's measure of correlations

Correlation is of following types:
1. Positive correlation:
 When the values of one variable increase with that of another are increased. The values of two
variables are changing with same direction. The high numerical values of one variable relate to
the high numerical values of the other. i.e. 0<r < 1.
 For example, Height and weight, study time and grades.
2. Negative correlation:
 When the values of one variable decrease with that of another are increased or vice versa. The
values of variables change with opposite direction. i.e. the high numerical values of one
variable relate to the low numerical values of the other. i.e. -1<r<0.
 For example, Price and quantity demanded, alcohol consumption and driving ability.
3. No Correlation:
 There is no impact on one variable with an increase or decrease of values of another variable.If
r=0 the two variables are uncorrelated. There is no linear relation between them.

4. Perfect Positive correlation:
 When there is a change in one variable, and if there is equal proportion of change in the
other variable say Y in the same direction, then these two variable are said to have a
Perfect Positive Correlation. i.e. r= 1.
5. Perfectly Negative correlation:
 Between two variables X and Y. if the change in X causes the same amount of change in Y
in equal proportion but in opposite direction, then this correlation is called as Perfectly
Negative correlation. r = -1.
 If there is correlation between two numerical sets of data, positive or negative, the
coefficient worked out can allow you to predict future trends between the two variables.
However, you must remember that you cannot be 100% sure that your prediction will be
correct because correlation does not determine cause or effect.

3. Spearman's Rank Correlation:
 Spearman's coefficient of correlation measures the linear association between ranks
assigned to individual items according to their attributes.
 Attributes are those variables which cannot be numerically measured such as intelligence of
people, physical appearance, honesty, etc. Ranking may be a better alter native to
quantification of qualities.


Regression:
 Regression analysis is a statistical tool used for the investigation of relationships
between variables. It is a method of predicting or estimating one variable knowing the
value of the other variable.
 Estimation is required in different fields in everyday life. A businessman wants to know
the effect of increase in advertising expenditure on sales or a doctor wishes to observe
the effect of a new drug on patients.
 An economist is interested in finding the effect of change in demand pattern of some
commodities on prices. Usually, we seek to ascertain the causal effect of one variable
upon another.
 We use a regression model to understand how changes in the predictor values are
associated with changes in the response mean. Regression analysis helps in
determining the cause and effect relationship between variables.
 We can also use regression to make predictions based on the values of the predictors.
It plays a significant role in many human activities, as it is a powerful and flexible tool
which used to forecast the past, present or future events on the basis of past or present
events.

 Regression analysis is also used to find trends in data. It will provide you with an equation
for a graph so that you can make predictions about your data.
 For example, you might guess that there is a connection between how much you eat and
how much you weigh; regression analysis can help you to quantify that.
 If you have been putting on weight over the last few years, it can predict how much you
will weigh in ten years time if you continue to put on weight at the same rate. It will also
give you a slew of statistics to tell you how accurate your model is.
 Thus, regression analysis models the relationships between a response variable and one
or more predictor variables. In simple words, regression analysis is used to model the
relationship between a dependent variable and one or more independent variables.
 Response variables are also known as dependent variables, Regressand, y-variables, and
outcome variables. Typically, you want to determine whether changes in the predictors are
associated with changes in the response.
 Predictor variables are also known as independent variables, Regressor, x-variables, and
input variables. A predictor variable explains changes in the response. Typically, you want
to determine how changes in one or more predictors are associated with changes in the
response.

For example, in a plant growth study, the response variable is the amount of growth that
occurs during the study. The investigators want to determine how changes in the
predictors are associated with changes in plant growth. The predictors are the amount of
fertilizer applied, the soil moisture, and the amount of sunlight.

Definition:
 “The statistical technique that expresses a functional relationship between two or
more variables in the form of an equation, to estimate the value of a variable,
based on the given value of another variable is called regression analysis".
 The variable whose value is to be estimated is called dependent variable and the
variable whose value is used to estimate this value is called independent
variable.
 The linear algebraic equations that express a dependent variable in terms of an
independent variable are called Linear Regression Equation.
 In terms of statistical inference, regression analysis is concerned with the
parameters of the regression equation that obtains between two or more variables
in the population.
 There are a variety of regression methodologies that you choose based on the
type of response variable, the type of model that is required to provide an
adequate fit to the data, and the estimation method.

The overall objectives of regression analysis can be summarized as follows:
1. To determine whether or not a relationship exists between two variables.
2. To describe the nature of the relationship, should one exist, in the form of a mathematical
equation.
3. To assess the degree of accuracy of description or prediction achieved by the regression
equation.
4. In the case of multiple regression, to assess the relative importance of the various predictor
variables in their contribution to variation in the criterion variable.
Types of Regression Models

The two basic types of regression analysis are:
1. Simple Regression Analysis:
 It is used to estimate the relationship between a dependent variable and a single independent
variable. Regression models that involve one explanatory variable are called Simple Regression. .
 For example, the relationship between crop yields and rainfall.
2. Multiple Regression Analysis:
 It is used to estimate the relationship between a dependent variable and two or more independent
Variables.
 When two or more explanatory variables are involved, the relationships are called Multiple
Regressions.
 For example, the relationship between the salaries of employees and their experience and education.
 Multiple regression analysis introduces several additional complexities but may produce more realistic
results than simple regression analysis. . Regression models are also divided into linear and nonlinear
models, depending on whether the relationship between the response and explanatory variables is
linear or nonlinear.
 In a simple linear regression, there are two variables x and y, wherein y depends on x or say
influenced by x. Here y is called as dependent, or criterion variable and x is independent or predictor
variable.

 The regression line of y on x is expressed as under:
y = a + bx
 where, a = constant, b = regression coefficient, In this equation, a and b are the two
regression parameters. While there are a number of possible criteria for choosing a best-
fitting line, one of the most useful is the least squares criterion.
 The slope b of the best-fitting line, based on the least squares criterion, can be shown be
where the summation is overall n pairs of (x1, y1) values.
The value of a, the y-intercept, can be turn be shown to be a function of b, x and ý i.e.
a = y - bx

 We can observe in following plot linear relationship the mileage and displacement of cars.
The green points are actual observations while the black line fitted is the line of regression.
 Regression Analysis:

Steps in Regression Analysis:
Regression analysis includes the following steps:
Step 1: Statement of the Problem under Consideration:
 The first important step in conducting any regression analysis is to specify the problem
and the objectives to be addressed by the regression analysis.
 The wrong formulation or the wrong understanding of the problem will give the wrong
statistical inferences. The choice of variables depends upon the objectives of study and
understanding of the problem.
Step 2: Choice of Relevant Variables:
 Once the problem is carefully formulated and objectives have been decided, the next
question is to choose the relevant variables.
 It has to kept in mind that the correct choice of variables will determine the statistical
inferences correctly.
 For example, in any agricultural experiment, the yield depends on explanatory variables
like quantity of fertilizer, rainfall, irrigation, temperature etc. These variables are denoted by
X. X. ..., X, as a set of k explanatory variables.

Step 3: Collection of Data on Relevant Variables:
 Once the objective of study is clearly stated and the variables are chosen, the next
question arises is to collect data on such relevant variables. The data is essentially the
measurement on these variables
 For example, suppose we want to collect the data on age. For this, it is important to know
how to record it. Then either the date of birth can be recorded which will provide the exact
age on any specific date or the age in terms of completed years as on specific date.
 Moreover, it is also important to decide that whether the data has to be collected on
variables as quantitative variables or qualitative variables.
 Examples of quantitative variables include height and weight, while examples of qualitative
variables include hair color, religion and gender. Quantitative variables are often
represented in units of measurement, and qualitative variables are represented in non-
numerical terms.

Step 4: Specification of Model:
 The experimenter or the person working in the subject usually helps in determining the
form of the model. Only the form of the tentative model can be ascertained and it will
depend on some unknown parameters. For example, a general form will be like
y = f(X1, X2, ..., Xk; B1, B2, ... Bk)+ €
where € is the random error reflecting mainly the difference in the observed value of y and
the value of y obtained through the model. The form of f (X1, X2, ..., Xk, B1, B2, B2, ..., Bk)
can be linear as well as nonlinear depending on the form of parameters (B1, B2, ..., Bk). A
model is said to be linear if it is linear in parameters.
For example,
y = B X + B X + B X + €
y = B + B ln X + € ,are linear models whereas,
y = B X + B X + B X + €
y = (In B1) X + B X + € ,are non-linear models.

 Step 5: Choice of Method for Fitting the Data:
 After the model has been defined and the data have been collected, the next task
is to estimate the parameters of the model based on the collected data. This is
also referred to as parameter estimation or model fitting.
 Parameter estimation (also called coefficient) are the change in the response
associated with a one-unit change of the predictor, all other predictors being held
constant.
 The most commonly used method of estimation is the least squares method.
Under certain assumptions, the least squares method produces estimators with
desirable properties. The other estimation methods are the maximum likelihood
method, ridge method, principal components method etc.

 Step 6: Fitting of Model:
 The estimation of unknown parameters using appropriate method provides the values of
the parameters. Substituting these values in the equation gives us a usable model. This is
termed as model fitting.
 The estimates of parameters B1,…., Bk in the model,
y = f(X1, X2, ..., XK, B1, B2, ..., Bk) + €
 are denoted as ßo, ß1, ..., Bk which gives the fitted model as
 y = f(X1, X2, ..., Xk , ßo, Bi.... , ßk)
 When the value of y is obtained for the given values of X1, X2, ..., Xk, it is denoted as y
and called as fitted value.
 The fitted equation is used for prediction. In this case, Ÿ is termed as predicted value.
Note that the fitted value is where, the values used for explanatory variables
correspond to one of the n observations in the data whereas predicted value, is the
one obtained for any set of values of explanatory variables. It is not generally
recommended to predict the y - values for the set of those values of explanatory variables
which lie outside the range of data. When the values of explanatory variables are the
future values of explanatory variables, the predicted values are called forecasted
values.

Step 7: Model Validation and Criticism:
 The validity of statistical methods to be used for regression analysis depends on various
assumptions. These assumptions are essentially the assumptions for the model and the
data.
 The quality of statistical inferences heavily depends on whether these assumptions are
satisfied or not. For making these assumptions to be valid and to be satisfied, care is
needed from the beginning of the experiment.
 One has to be careful in choosing the required assumptions and to examine whether the
assumptions are valid for the given experimental conditions or not. It is also important to
decide the situations in which the assumptions may not meet.
 The validation of the assumptions must be made before drawing any statistical conclusion.
Any departure from validity of assumptions will be reflected in the statistical inferences. In
fact, the regression analysis is an iterative process where the outputs are used to
diagnose, validate, criticize and modify the inputs.

 Step 8: Using the Chosen Model(s) for the Solution of the posed problem and
forecasting:
 The determination of explicit form of regression equation is the ultimate objective of
regression analysis. It is finally a good and valid relationship between study variable and
explanatory variables
 The regression equation helps in understanding the interrelationships among the variables.
Such regression equation can be used for several purposes.
 For example, to determine the role of any explanatory variable in the joint relationship in
any policy formulation, to forecast the values of response variable for given set of values of
explanatory variables.

 • Applications or uses of Regression Analysis:
 1. Predictive Analytics:
 Predictive analytics i.e. forecasting future opportunities and risks is the most prominent
application of regression analysis in business. Demand analysis, for instance, predicts the
number of items which a consumer will probably purchase.
 • However, demand is not the only dependent variable when it comes to business.
Regression analysis can go far beyond forecasting impact on direct revenue.
 • For example, Insurance companies heavily rely on regression analysis to estimate the
credit standing of policyholders and a possible number of claims in a given time period.
 2. Operation Efficiency:
 • Regression models can also be used to optimize business processes. A factory manager,
for example, can create a statistical model to understand the impact of oven temperature
on the shelf life of the cookies baked in those ovens. • In a call center, we can analyze the
relationship between wait times of callers and number of complaints.Data-driven decision
making eliminates guesswork, hypothesis and corporate politics from decision making.
 • This improves the business performance by highlighting the areas that have the
maximum impact on the operational efficiency and revenues.

 3. Supporting Decisions:
 Today businesses are overloaded with data on finances, operations and customer purchases.
Increasingly, executives are now leaning on data analytics to make informed business
decisions.
 Regression analysis can bring a scientific angle to the management of any businesses. By
reducing the tremendous amount of raw data into actionable information, regression analysis
leads the way to diving into execution smarter and more accurate decisions. This technique acts
as a perfect tool to test a hypothesis before diving execution.

4. Correcting Errors:
 Regression is not only great for lending empirical support to management decisions but also for
identifying errors in judgment hopping hours will greatly increase sales.
 For example, a retail store manager may believe that extending • Regression analysis, however,
may indicate that the increase in revenue might not be sufficient to support the rise in operating
expenses due to longer working hours (such as additional employee labor charges).
 Hence, regression analysis can provide quantitative support for decisions and prevent mistakes
due to manager's intuitions.

5. New Insights:
 • Over time businesses have gathered a large volume of unorganized data that has the
potential to yield valuable insights. However, this data is useless without proper analysis.
 • Regression analysis techniques can find a relationship between different variables by
uncovering patterns that were previously unnoticed.
 • For example, analysis of data from point of sales systems and purchase accounts may
highlight market patterns like increase in demand on certain days of the week or at certain
times of the year. You can maintain optimal stock and personnel before a spike in demand
arises by acknowledging these insights.

Sr,No. Basis for
comparison
Correlation Regression
1 Meaning Correlation is a statistical measures
which determines co-relationship
association of two variables
Regression describes how an
independent variable is numerically
related to the dependent variable
2 Usage TO represent linear relationship
between two variables
To fit a best line and estimate onr
variable on the basis of another
variable.
3 Dependent and
independent
variable
No difference Both variables are different
4 Indicates Correlation coefficient indicates the
extent to which two variables move
together.
Regression indicates the impact of a
unit changes in the known variable(x)
on the estimated variable(y).
5 Objective To find a numerical value
expressing the relationship
variables
To estimate values of random variable
on the basis of the values of fixed
variable.

Regression and correlation

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Regression and correlation

Similar to Regression and correlation (20)

Recently uploaded

Recently uploaded (20)

Regression and correlation