Unit 5 Correlation

Unit-5
Correlation :-
Suppose we have aset of 30 studentsina class andwe want to measure the heightsandweightsof all
the students.We observe thateachindividual(unit) of the setassumestwovalues –one relatingtothe
heightandthe otherto the weight.Suchadistributionin whicheachindividual orunitof the setis made
up of two valuesiscalledabivariate distribution. Some examplesof bivariate distributionare
(i) In a classof 60 studentsthe seriesof marksobtainedintwosubjectsbyall of them.
(ii) The seriesof salesrevenue andadvertisingexpenditureof twocompaniesinaparticular
year.
(iii) The seriesof agesof husbandsandwivesinasample of selectedmarriedcouples.
Thus ina bivariate distribution,we are givenasetof pairsof observations,whereineachpairrepresents
the valuesof twovariables.
In a bivariate distribution,we are interestedinfindingarelationship(if itexists) betweenthe two
variablesunderstudy.The conceptof ‘correlation’isastatistical tool whichstudiesthe relationship
betweentwovariablesandCorrelationAnalysisinvolvesvariousmethodsandtechniquesusedfor
studyingandmeasuringthe extentof the relationshipbetweenthe twovariables.
Definition:-Twovariablesare saidtobe incorrelationif the change inone of the variablesresultsin a
change in the othervariable.
Types of Correlation:-
Varioustypesof correlation are positive,negative,nocorrelation,perfect,strongandweakcorrelation.
Positive Correlation
Positive correlationoccurswhenanincrease inone variable increasesthe valueinanother.
The line correspondingtothe scatterplotisan increasingline.
Negative Correlation
Negative correlationoccurswhenanincrease inone variable decreasesthe value of another.
The line correspondingtothe scatterplotisa decreasingline.

No Correlation
No correlationoccurswhenthere isnolineardependencybetweenthe variables.
PerfectCorrelation
Perfectcorrelationoccurswhenthere isafuncional dependencybetweenthe variables.
In thiscase all the pointsare ina straightline.
Strong Correlation
A correlationisstrongerthe closerthe pointsare locatedtoone anotheron the line.
WeakCorrelation
A correlationisweakerthe fartherapart the pointsare locatedto one anotheronthe line.
Some examplesof seriesof positive correlationare:
(i) Heightsandweights;
(ii) Householdincome andexpenditure;
(iii) Price and supplyof commodities;
(iv) Amountof rainfall andyieldof crops.
Correlationbetweentwovariablesissaidtobe negative orinverse if the variablesdeviateinopposite
direction.Thatis,if the increase inthe variablesdeviate inopposite direction.Thatis,if increase (or
decrease) inthe valuesof one variable resultsonanaverage,incorrespondingdecrease (orincrease) in
the valuesof othervariable.
Some examplesof seriesof negative correlationare:
(i) Volume andpressure of perfectgas;
(ii) Currentand resistance [keepingthe voltage constant](𝑅 =
𝑉
𝐼
);
(iii) Price and demandof goods.

Note:
(i) If the pointsare veryclose to eachother,a fairlygoodamountof correlationcanbe
expectedbetweenthe twovariables.Onthe otherhandif theyare widelyscatteredapoor
correlationcanbe expectedbetweenthem.
(ii) If the pointsare scatteredandtheyreveal noupwardor downwardtrendas inthe case of
(d) thenwe say the variablesare uncorrelated.
(iv) If there is an upwardtrendrisingfromthe lowerlefthandcornerandgoingupwardto the
upperrighthand corner, the correlationobtainedfromthe graphissaidto be positive.Also,
if there isa downward trendfromthe upperlefthandcornerthe correlationobtainedissaid
to be negative.
(v) The graphs shownabove are generallytermedasscatterdiagrams.
The CoefficientofCorrelation (Karl Pearson’smethod)
The Karl Pearson’smethodispopularlyknownasPearson’sCoefficientof correlation.
One of the mostwidelyusedstatisticsisthe coefficientof correlation ‘𝑟’whichmeasuresthe degree of
association betweenthe twovaluesof relatedvariablesgiveninthe dataset.The coefficientof
correlation‘r’isgivenbythe formula
𝑟 =
∑ 𝑋𝑌
𝑛𝜎 𝑥 𝜎 𝑦
=
∑ 𝑋𝑌
√∑ 𝑥2 ∑ 𝑦2
[∵ 𝜎2
𝑥 =
∑ 𝑥2
𝑛
; 𝜎2
𝑦 =
∑ 𝑦2
𝑛
]
Here 𝑋 = ( 𝑥 − 𝑥̅); 𝑌 = ( 𝑦 − 𝑦̅)
𝜎 𝑥 =Standarddeviationof series 𝑥
𝜎 𝑦 =Standarddeviationof series 𝑦
𝑛 = Numberof pairsof observations
𝑟 = The (productmoment) correctioncoefficient
Thismethodisto be appliedonlywhere deviationsof itemsare takenfromactual meanandnot from
the assumedmean.
The valuesof coefficientof correlation ‘𝑟’obtainedfromthe above formulaalwayslies between ±1.
Whenr = +1 it meansthere isa perfectpositivecorrelationbetweenthe variables. Whenr= -1 it means
there isa perfectnegative correlationbetweenthe variables. Howeverif r= 0 there isno relationship
betweenthe variables.
Direct method:-
Substitutingthe valuesof 𝜎 𝑥 and 𝜎 𝑦 inthe above formula,we get
𝑟 =
∑ 𝑋𝑌
√∑ 𝑋2 ∑ 𝑌2
,
or
𝑛 ∑ 𝑋𝑌
√[ 𝑛 ∑ 𝑥2−(∑𝑥)2×{ 𝑛∑ 𝑦2−∑ 𝑥2}]
Example:- Making use of the data summarizedbelow,calculate the coefficientof correlation.
Case A B C D E F G H

x 10 9 6 10 12 13 11 9
y 9 4 6 9 11 13 8 4
Solution:-
Case 𝑥 𝑥 − 10
= 𝑋
𝑋2 𝑦 𝑦 − 8
= 𝑌
𝑌2 𝑋𝑌
A 10 0 0 9 1 1 0
B 9 -4 16 4 -4 16 16
C 6 -1 1 6 -2 4 2
D 10 0 0 9 +1 1 0
E 12 +2 4 11 +3 9 6
F 13 +3 9 13 +5 25 15
G 11 +1 1 8 0 0 0
H 9 -1 1 4 -4 16 4
𝑛 = 8 ∑𝑥 = 80 ∑𝑋 = 0 ∑𝑋2 = 32 ∑𝑦 = 64 ∑𝑌 = 0 ∑𝑌2 = 72 ∑𝑋𝑌 = 43
𝑥̅ =
∑𝑥
𝑛
=
80
8
= 10 , 𝑦̅ =
∑𝑦
𝑛
=
64
8
= 8
𝑟 =
∑ 𝑋𝑌
√∑ 𝑋2 ∑ 𝑌2
=
43
√32 × 72
=
43
√2304
=
43
48
= +0.896
Directmethod:-
Substitutingthe valuesof 𝜎 𝑥 and 𝜎 𝑦 inthe above formula,we get
𝑟 =
∑ 𝑋𝑌
√∑ 𝑋2 ∑ 𝑌2
,
or
𝑛 ∑ 𝑋𝑌
√[ 𝑛 ∑ 𝑥2−(∑𝑥)2×{ 𝑛∑ 𝑦2−∑ 𝑥2}]
Regression
If two variablesare significantlycorrelated,andif there issome theoretical basisfordoingso,itis
possible topredict (estimate) valuesof one variable fromthe other.Thisobservationleadstoavery
importantconceptknownas ‘RegressionAnalysis’.
For example,if we knowthatthe advertisingandsalesare correlatedwe findoutexpectedamountof
salesfora givenadvertisingexpenditure forattainingagivenamountof sales.Similarlyif we knowthe
yieldof rice andrainfall are closelyrelatedwe mayfindoutthe amountof rainis requiredto achieve a
certainproductionfigure.
In general Regressionanalysis meansthe estimationorpredictionof the unknownvalue of one variable
fromthe knownvalue of the othervariable.Itisone of the most importantstatistical toolswhichis
extensivelyusedinalmost all sciences –Natural,Social andPhysical.Itis speciallyusedinbusinessand
economicstostudythe relationshipbetween twoormore variablesthatare relatedcausallyandforthe
estimationof demandandsupplygraphs,costfunctions,productionand consumption functionsandso
on.
Predictionorestimationisone of the majorproblemsinalmostall the spheresof humanactivity.The
estimationorpredictionof future production, consumption,prices,investments,sales,profits,income
etc.are of verygreatimportance tobusinessprofessionals.Similarly,populationestimatesand

Population projections,GNP,Revenue andExpenditure etc.are indispensableforeconomistsand
efficientplanningof aneconomy.
The dictionarymeaningof ‘Regression’isreturningorgoingback.The term‘Regression’isfirstusedby
Sir FrancisGalton(1822-1911) in 1877 while studyingthe relationshipbetweenthe heightof fatherand
sons.Thisterm wasintroducedbyhiminthe paper of “RegressiontowardsMediocrityinhealthcare
structure”.RegressionanalysiswasexplainedbyM.M. Blairas follows:
“Regressionanalysisisamathematical measure of the average relationship betweentwoormore
variablesintermsof the original unitsof the data”.
Line of Regression
If the dotsof the scattereddiagramgenerally,tendstoclusteralonga well-defineddirectionwhich
suggesta linearrelationshipbetweenthe variable x andy,suchline of bestfitfor givendistributionof
dotsis called‘line of regression’.
There are twosuch lines,one givingthe bestpossible meanvaluesof yforeach specifiedvalueof x and
the othergivingthe bestpossible meanvaluesforx forgivingvaluesof y.The formeriscalledthe line of
regressionof yon x and lateris calledthe line of regression of x ony.
Firstconsiderthe line of regressionof yonx.
Let straightline satisfyingthe general trendof ndotsin a scattereddiagrambe
𝑦 = 𝑎 + 𝑏𝑥 ⋯(𝑖)
We have to determinethe constantaand b so that 𝑒𝑞𝑢𝑎𝑡𝑖𝑜𝑛 (𝑖) givesforthe each value of x,the best
estimate forthe average value of 𝑦. Thusthe normal equationfora and b are
∑𝑦 = 𝑛𝑎 + 𝑏∑𝑥 ⋯(𝑖𝑖)
∑𝑥𝑦 = 𝑎∑𝑥 + 𝑏∑𝑥2 ⋯(𝑖𝑖𝑖)
Equation (𝑖𝑖)gives
1
𝑛
∑𝑦 = 𝑎 + 𝑏.
1
𝑛
∑𝑥
i.e. 𝑦̅ = 𝑎 + 𝑏𝑥̅
Thisshowsthat ( 𝑥̅, 𝑦̅), i.e.meanof x and y lie on (𝑖).
Shiftingthe originto ( 𝑥̅, 𝑦̅), equation (𝑖𝑖𝑖) takesthe form
∑( 𝑥 − 𝑥̅)( 𝑦 − 𝑦̅) = 𝑎∑( 𝑥 − 𝑥̅) + 𝑏∑( 𝑥 − 𝑥̅)2
But ∑( 𝑥 − 𝑥̅) = 0
∴ 𝑏 =
∑( 𝑥 − 𝑥̅)( 𝑦 − 𝑦̅)
∑( 𝑥 − 𝑥̅)2 =
∑𝑋𝑌
∑𝑋2 =
∑𝑋𝑌
𝑛𝜎 𝑥
2 = 𝑟
𝜎 𝑦
𝜎 𝑥
⋯(∵ 𝑟 =
∑ 𝑋𝑌
𝑛𝜎 𝑥 𝜎𝑦
)
Thus the line of bestfitbecomes
( 𝑦 − 𝑦̅) = 𝑟
𝜎 𝑦
𝜎 𝑥
( 𝑥 − 𝑥̅)
whichisthe equation of line of regression of y on x.Its slope iscalledthe regression coefficientof yon x.
Interchangingx andy, the line of regressionx onyis

( 𝑥 − 𝑥̅) = 𝑟
𝜎 𝑥
𝜎 𝑦
( 𝑦 − 𝑦̅)
Thus the regressioncoefficientyonx = 𝑟
𝜎 𝑦
𝜎 𝑥
and the regressioncoefficientx ony = 𝑟
𝜎 𝑥
𝜎 𝑦
.
Corollary:-
Correlationcoefficientisthe geometricmeanbetweenthe tworegressioncoefficients
𝑟
𝜎 𝑦
𝜎 𝑥
× 𝑟
𝜎 𝑦
𝜎 𝑥
= 𝑟2.
Example:-
From the followingdataobtainthe tworegressionequation andcalculate the regressionequationtaking
deviationof itemsfrommeanof x andy series.
x 6 2 10 4 8
y 9 11 5 8 7
Solution:-
OBTAINING REGRESSION EQUATION
𝑥 𝑦 𝑥𝑦 x2 y2
6 9 54 36 81
2 11 22 4 121
10 5 50 100 25
4 8 32 16 64
8 7 56 64 49
∑𝑥 = 30 ∑𝑦 = 40 ∑𝑥𝑦 = 214 ∑x2 = 220 ∑y2 = 340
Regressionequationof yonx: 𝑦 = 𝑎 + 𝑏𝑥
∑𝑦 = 𝑛𝑎 + 𝑏∑𝑥
∑𝑥𝑦 = 𝑎∑𝑥 + 𝑏∑𝑥2
Substitutingthe values
40 = 5𝑎 + 30𝑏 ⋯(𝑖)
214 = 30𝑎 + 220𝑏 ⋯(𝑖𝑖)
Multiplyingequation (𝑖)by6, 240 = 30𝑎 + 180𝑏 ⋯(𝑖𝑖𝑖)
214 = 30𝑎 + 220𝑏 ⋯(𝑖𝑣)
Subtractingequation (𝑖𝑣)from (𝑖𝑖𝑖)−40𝑏 = 26 𝑜𝑟 𝑏 = −0.65
Substitutingthe value of binequation(𝑖)
40 = 5𝑎 + 30(−0.65) 𝑜𝑟 5𝑎 = 40 + 19.5 = 59.5 𝑜𝑟 𝑎 = 11.9
Puttingthe valuesof a and b in equation,the regressionof yonx is = 11.9 − 0.65𝑥 .
Regressionequationof x ony: 𝑥 = 𝑎 + 𝑏𝑦
∑𝑥 = 𝑛𝑎 + 𝑏∑𝑦
∑𝑥𝑦 = 𝑎∑𝑦 + 𝑏∑𝑦2
30 = 5𝑎 + 40𝑏 ⋯(𝑖)
214 = 40𝑎 + 340𝑏 ⋯(𝑖𝑖)
Multiplyingequation (𝑖)by 8: 240 = 40𝑎 + 320𝑏 ⋯(𝑖𝑖𝑖)

214 = 40𝑎 + 340𝑏 ⋯(𝑖𝑣)
From equation (𝑖𝑖𝑖) and(𝑖𝑣) − 20𝑏 = 26 𝑜𝑟 𝑏 = −13
Substitutingthe value of binequation (𝑖);
30 = 5𝑎 + 40(−1.3) 𝑜𝑟 5𝑎 = 30 + 52 = 82 𝑎 = 16.4
Puttingthe value of a and b inthe equation,the regressionlineof x ony is = 16.4 − 1.3𝑦 .
CALCULATION OF REGRESSION EQUATIONS
x 𝑥 − 𝑥̅ = 𝑋 𝑋2 y 𝑦 − 𝑦̅ = 𝑌 𝑌2 𝑋𝑌
6 0 0 9 +1 1 0
2 -4 16 11 +3 9 -12
10 +4 16 5 -3 9 -12
4 -2 4 8 0 0 0
8 +2 4 7 -1 1 -2
∑𝑥 = 30 ∑𝑋 = 0 ∑𝑋2 = 40 ∑𝑦 = 40 ∑𝑌 = 0 ∑𝑌2 = 20 ∑𝑋𝑌 = −26
𝑥̅ =
30
5
= 6 ; 𝑦̅ =
40
5
= 8
The line of regressionx ony is
( 𝑥 − 𝑥̅) = 𝑟
𝜎 𝑥
𝜎 𝑦
( 𝑦 − 𝑦̅)
𝑟
𝜎 𝑥
𝜎 𝑦
=
∑𝑋𝑌
∑𝑌2 =
−26
20
= −1.3
𝑥 − 6 = −1.3( 𝑦 − 8) = −1.3𝑦 + 10.4
𝑥 = −1.3𝑦 + 10.4 + 6 = 16.4 − 1.3𝑦
The line of regressionyonx is
( 𝑦 − 𝑦̅) = 𝑟
𝜎 𝑦
𝜎 𝑥
( 𝑥 − 𝑥̅)
𝑟
𝜎 𝑦
𝜎 𝑥
=
∑𝑋𝑌
∑𝑋2 =
−26
40
= −0.65
𝑦 − 8 = −0.65( 𝑥 − 6) = −0.65𝑥 + 3.9
𝑦 = −0.65𝑥 + 3.9 + 8 = 11.9 − 0.65𝑥
Thus we findthe same answerwhatobtainedearlier.However,the calculationsare verymuch
simplifiedwithoutthe use of the normal equation.
Experiment:-
An experimentisa treatmenton a groupof objectsor subjectsinthe interestof observingthe response.
Treatment:-
In experiments,atreatmentissomethingthatresearchersadministertoexperimental units.
For example,acornfieldisdividedintofour,eachpartis'treated'witha differentfertilizertosee which
producesthe mostcorn; a teacherpracticesdifferentteachingmethodsondifferentgroupsinherclass
to see whichyieldsthe bestresults;adoctortreats a patientwithaskinconditionwithdifferentcreams
to see whichismosteffective.Treatmentsare administeredtoexperimental unitsby'level',where level
impliesamountormagnitude.Forexample,if the experimental unitsweregiven5mg,10mg,15mg of a

medication,those amountswouldbe three levelsof the treatment.
(Definition taken fromValerie J. Easton and John H.McColl's StatisticsGlossary v1.1)
Factor:-
A factorof an experimentisacontrolledindependentvariable;avariable whose levelsare setbythe
experimenter.
A factor isa general type orcategory of treatments.Differenttreatmentsconstitute differentlevelsof a
factor.
For example,threedifferentgroupsof runnersare subjectedtodifferenttrainingmethods.The runners
are the experimental units,the trainingmethods,the treatments;where the three typesof training
methodsconstitute three levelsof the factor'type of training'.
(Definition taken fromValerie J. Easton and John H.McColl's StatisticsGlossary v1.1)
Experimental Design
The analysisof data generatedfromanexperiment.Asittakestime toorganize the experimentproperly
to ensure thatthe right type of data, andenoughof it, isavailable toanswerthe questionsof interestas
clearlyandefficientlyaspossible.Thisprocessiscalled experimental design.
There are six conceptsof experimentaldesign:
(i) IndependentVariable
(ii) DependentVariable
(iii) Constant
(iv) Control group
(v) ExperimentalGroup
(vi) Repeatedtrials
Variable:-Variable isthatchange duringthe experiment.
IndependentVariable:- IndependentVariableisthatchange on purpose bythe experimenter. Itisalso
knownas cause,stimulus,reasonormanipulated variable. Itisthe “if” part of the hypothesis.
DependentVariable:- The variable thatrespondtothe independentvariableiscalledDependent
Variable Itisknownas effect,resultorrespondingvariable.Itisthe thenpartof the hypothesis.
Constant:-All factorswhichare not allowedto change duringthe experimentsare calledconstant.
Control Group:- Control groupis the groupor the standardto whicheverythingiscompared.
Experimental Group:- The experimentalgroupisthe groupwhichistestedwiththe Independent
Variable.Eachtestgrouphas onlyone factor differentfromthe othergroup that isthe independent
variable.
Repeatedtrials:- Repeatedtrialsisthe numberof timesthe experimentisrepeated.The more timeswe
repeatthe experiment,we will getthe more validresult.

The IVCDV (IndependentVariable ConstantDependentVariable) chartisusedtodesignthe experiment.
IV Constant DV
Fertilizer
0 drop
2 drop
4 drop
6 drop
Amounts of water
Types of soil
Amount of soil
Type of plant
Type of planter
Size of planter
Type of light
Location
Plant growth
The Variable isthatchange duringthe experiment. Here the dropof fertilizer0,2,4or 6 is variedby
the experimenter.Plantgrowthisthe dependentvariable thatdependsonthe dropof fertilizer,So
it isthe dependentvariable.The othersare constants.
Amountsof water,Typesof soil,Amountof soil,Type of plant,Type of planter,Size of planter,Type of
light,locationare constants.
If we wantto testthe soil insteadof fertilizerthanfertilizerbecomethe constantandtype of soil
become the independentvariable.

The plant growththat we can observe here iscalled
(i) the result(of addingfertilizer)
(ii) the response (of addingfertilizer)
(iii) the effect(of addingfertilizer)
Completely Randomized Designs:-
Completely randomized designs are the simplest in which the treatments are assigned to the
experimental units completely at random. This allows every experimental unit, i.e., plot, animal,
soil sample, etc., to have an equal probability of receiving a treatment.
REFERRENCE
1. Statistical Method by S.C. Gupta
2. en.wikipedia.org/wiki/Statistics
3. www.mathsisfun.com/data/probability.html
4. www.stats.gla.ac.uk/steps/glossary/sampling.html
5. A First Course in statistics with application by A K P C Swain
6. A test book of agricultural Statistics by R. Rangaswamy
7. Fundamental of statistics, Vol.-I and II by A.M. Goon, M.K. Gupta and
B. Dasgupta
8. https://www.youtube.com/watch?feature=player_detailpage&v=UN206
cSaF0k#t=7
9. Statistics Glossary v1.1

Unit 5 Correlation

More Related Content

Similar to Unit 5 Correlation

More from Rai University

Recently uploaded

Unit 5 Correlation