SlideShare a Scribd company logo
1 of 44
© 2006 Hewlett-Packard Development Company, L.P.
The information contained herein is subject to change without notice
Oleg Janke
25-June-2012
Tip of the hat to Kevin Kacmarynski
Scatterplots and
Cautions of
Correlation
Topics for Today’s Discussion
• Scatterplots:
− Why bother?
− Creating & Analyzing
− Good usage
• Correlation and Association
• Potential Missteps
• Summary
• Real Personal Application
Presumption
• Basic graphing skills
• Familiar with charting (for example, as Excel)
• YES NO
Scatterplots: Why bother?
• Scatterplot (Scatter diagram)
− Converts two columns of numbers
(ordered pairs) into picture
− Explores relationship between two
quantitative variables
• What value does it have?
− Determine possible cause and
effect links (control)
− Predict results of variable that is
difficult to measure if it is strongly
related to another variable that is
easier to measure (proxy)
Creating a scatterplot
What is the relationship between height & weight?
Creating a scatterplot
Plot each characteristic of interest on a standard XY plot.
Katie James
Analyzing a scatterplot
Is there a relationship? OR Is it just randomness? (N=40)
Analyzing a scatterplot
Is there a relationship? OR Is it just randomness? (N=40)
How confident are you
that there is a linear
relationship between
height and weight in
this data set? (Choose
one)
• 100%
• 99-100%
• 95-99%
• 90-95%
• 80-90%
• insufficient data to say
• no relationship
Analyzing a scatterplot
Is there a relationship or is it just randomness? (N=40)
• Add Median lines and count quadrant totals+
Median X
14
6
6
14
Median Y
+
Olmstead-Tukey 1947
Analyzing a scatterplot
Is there a relationship or is it just randomness? (N=40)
• Add Median lines and count quadrant totals+
Median X
14
6
6
14
Median Y
+
Olmstead-Tukey 1947
NO relationship
• shotgun effect
• appx equal number in
each quadrant
IS a relationship
• one diagonal will dominate
Analyzing a scatterplot
Is there a relationship or is it just randomness? (N=40)
• Add Median lines and count quadrant totals
Median X
14
6
6
14
Median Y
•Less than 5% chance data could
align this way simply from
randomness
• Therefore fairly confident X& Y
are related
SIGN TEST TABLE *
N 1% 5%
10 0 1
20 3 5
30 7 9
40 11 13
50 15 17
60 19 21
* Ishikawa “Guide to Quality Control”, 1976
Good usage of Scatterplots
• In this plot, we observe
a clear relationship
between height and
weight
• As height of individuals
increase, their weight
tends to increase as
well
• In the ideal case this
relationship is called
Body Mass Index (BMI)
Good usage of scatterplots
Scenario 1:
We are building
parts on one line in
one location.
What does the plot tell us about part
length and part diameter?
Good usage of scatterplots
Scenario 2:
We build the
same part on two
different Lines.
Now what does plot tell us about part length & part diameter?
Good usage of scatterplots
• IS a Line effect here
• Relationship between
Diam & Length differs by
Line
• Diam1 twice Daim2
− Tighter process control?
• Length1 < Length2
Now what does plot tell us about part length & part diameter?
• Always be alert to possible strata in the data
• Plotting your data is crucial for discovery
Correlation Coefficient
• Correlation is defined as measure of strength of linear
relationship between two quantitative variables
− Correlation coefficient is a mathematically calculated value:
− Correlation values are always between -1 and +1
• 0 indicates no correlation (perfect shotgun pattern)
• -1 and +1 indicates perfect correlation (all points fall on line)
• Sign indicates direction
− Positive: up and to right
− Negative: down and to left
Correlation and Association
• Re-visiting our first example,
we saw strong, positive
relationship between height
and weight
• Supported by correlation
coefficient value of 0.709
• Relationship exists
• Does NOT prove causality
Correlation = 0.709
Calculation provided by JMP statistical software
Correlation and Association
Medical Trial
• Dosage 490-510 mg
• Recorded therapeutic response
from 20 to 100
Is there a correlation between
Dosage and Response?
• Yes
• No
• Insufficient data
• Don’t know
Correlation and Association
• Since calculated
correlation value is zero,
there is no association
between dosage and
desired response! Right?
− No LINEAR relationship
• Correlation coefficient, by
itself, does not tell the
entire story
• Always look at your graphs
to see what the data say
Correlation coefficient = 0
Calculation provided by JMP statistical software
2 2
2 2
Questions?
How to create/analyze Scatterplots
Correlation and Associations
Missteps with scatterplots & correlation
1. Bimodal distributions
2. Stratified data
3. Lurking variables
4. Extrapolation
5. Too narrow range of X (independent variable)
6. Weak/sloppy measurement
7. Chicken and Egg Syndrome
Misstep #1with scatterplots &
correlation
• What is my house worth?
− Sale price data & house size
were collected on 21 houses
in the same town
− Another house (mine) in
same town is 2300 square
feet in size, so it should be
worth a little over $200K
− Correlation coefficient =
0.943 (very high)
What is the problemin this analysis and the
resulting conclusion?
Misstep #1with scatterplots &
correlation
What is the problem?
• Relationship/correlation dependent solely on one data point
• Why might this one point not be appropriate?
• Location (school district, suburban)
• Features (pool, lot, barn, view)
• Timing (peak of housing bubble)
• Example of Bimodal data
Need both appropriate data & proper analysis techniques
4
7
6
4
Is there a linear relationship?
• What do median lines say about the
relationship?
Sign Test Table
N 1% 5%
10 0 1
20 3 5
30 7 9
40 11 13
50 15 17
Misstep #2 with scatterplots & correlation
• Based on this data set,
with high Correlation
Coefficient (0.780) what’s
the relationship between
shoe size and knowledge?
• What’s missing?
Correlation = 0.780
Calculation provided by JMP statistical software
Misstep #2 with scatterplots & correlation
• Does this help solve mystery?
• Be sure to look for hidden
variables that might have an
impact on relationship
• Stratified Data– sub population
with different relationships – can
give erroneous conclusions
For example: CSAT data
• Those who respond to survey
• Those who do NOT respond to survey
Do both groups have similar opinions?
Misstep #3 with scatterplots &
correlation
Beware of Lurking Variables
• Related thru common 3rd
variable
− Ice cream sales correlates with water usage (temperature)
− Height–weight example (age)
− Call vol at hp Call Center A correlated w/call vol at hp CC--B (business)
• Related thru independent growth (decay) rates
− Population in Indonesia correlates with price of tea in NYC (growth)
− My car’s value correlates to grams of Cobalt-60 isotope (decay)
• Both have half-lives of about 5-6 years
• Related through measuring same characteristic differently
− Weight in pounds is correlated to weight in kilos
− Attendance at an event is correlated to empty seats at same event
− Area of a US state is correlated with population of that state
• Some notable exceptions (AK, MT)
Does School Spending Educate
Students?
States spending more per student have lower SAT* scores!
Expediture/pupil by State vs SAT
900
950
1000
1050
1100
1150
1200
$4,000 $5,000 $6,000 $7,000 $8,000 $9,000 $10,000 $11,000 $12,000
Expenditure per pupil for public school K-12 (2002-03)
SATscoresfor1998
Obvious
Negative
Correlation
* SAT test is a standardized test used by many colleges across
US to determine level of student preparedness for college
Does School Spending Educate
Students?
States spending more per student
have lower SAT scores!
Expediture/pupil by State vs SAT
900
950
1000
1050
1100
1150
1200
$4,000 $5,000 $6,000 $7,000 $8,000 $9,000 $10,000 $11,000 $12,000
Expenditure per pupil for public school K-12 (2002-03)
SATscoresfor1998
Negative Correlation
Do we have a measurement issue here?
What does SAT scores actually measure?
• Test performance – at a minimum
• Education – Not always correlated with Knowledge
• Knowledge – Our belief that this leads to Life Success
• Life Success -- This is what we would like to be the case
Be careful of proxies that stand in for other measures
Does School Spending Educate
Students?
States spending more per student
have lower SAT scores!
Expediture/pupil by State vs SAT
900
950
1000
1050
1100
1150
1200
$4,000 $5,000 $6,000 $7,000 $8,000 $9,000 $10,000 $11,000 $12,000
Expenditure per pupil for public school K-12 (2002-03)
SATscoresfor1998
Negative Correlation
Do we have a measurement issue here?
No! SAT scores are predictive of Life Success-- financial
• Life Success – college grad, good job
• Not with certainty, but on average
• Should we move to states with lower student spending?
Does School Spending Educate
Students?
Does Percent of students taking SAT impact SAT scores?
Correlation Coefficient = .92
Beware the Lurking Variable!
1998 SAT by State
y = 1278x-0.0575
R2
= 0.8461
900
1000
1100
1200
0 10 20 30 40 50 60 70 80 90
% Taking SAT
CompositeSAT
OR
SC
WV
DC
NH
GA
WA
AK
CO
IL
M N
TX
MS
OH
KS
WI
USA
UT
M A
NY
NJ
CT
INHI NC
NV
M T
VT
VT PA
RI
MSMSMS
Misstep #4 with scatterplots & correlation
Back to Height & Weight data
• How much would an 100 inch (~2.5meters) person weigh?
80 90 100
300
280
260
240
220
200
•From scatterplot he
would weight ~300#
(136 kg)
•Can we make this
prediction?
Misstep #4 with
scatterplots & correlation
• “Predicted” 300 pounds!
•Robert Wadlow was 8 ft 11 (2.72 m)
and weighed 439# (199 kg)
• Interpolating within range of
independent variable set --
acceptable
•Extrapolating beyond range of
independent variable is dangerous
• Relationship may not be stable
Misstep #5 with scatterplots & correlation
Back to Height & Weight data
• What would conclusion be if height ranged from 1600.0 mm to
1700.0 mm (64-66 inches)?
• Easily conclude no
relationship between
height and weight
• Make sure range of
independent variable (X)
sufficiently large relative
to dependent variable (Y)
Misstep #6 with scatterplots & correlation
Poor Measurement System
• Inappropriate tool or gage to measure
− Pixel width with standard yard/meter stick
− Monitor response time with second hand on watch
• Weak tech repeatability
− Tech visual determination of damage of NB set in for repair
− Typo-graphical errors on a written page
Ensure Measurement System Analysis performed before data are collected
Misstep #7 with scatterplots & correlation
Chicken and Egg Syndrome
• Which came first?
• What is the cause and what is the effect?
− Do children from poor families do poorly academically because they
are poor OR are they poor because of poor academic performance?
− Do consumers buy good product out of loyalty OR are consumers
loyal because of good product?
• Vicious/Virtuous Cycles – hard to break through
• Relationship is there; Causality is not easily determined
Misstep Summary
Watch out for ….
• Bimodal distribution House Size and Price
• Stratified data  Shoe Size and Knowledge
• Lurking variable  SAT Scores and Participation Rate
− Underlying third variable
− Common but unrelated growth/decay curves
− Same variable measured differently
• Extrapolation  Height and Weight for tallest man
− Generalizing from a sampled subset to a broader, larger population
• Narrow range of X  Height and Weight
• Sloppy measurement
− Can hide a real relationship
− May create one when none exists
• Chicken and Egg Syndrome
− Variables are related, but which is cause & which is effect?
Questions?
Seven Missteps
Summary
• Scatterplots
− Simple, but powerful tool to explore relationships
between two quantitative variables
• Be sure data are representative of question
− “What are we trying to accomplish?”
• Plot data to lookforanomalies orassociations
• Correlation has special meaning
− Correlation does not imply causation
− Nor does lack of correlation deny causation
• Recall Missteps that may impact scatterplot/correlation
analysis including lurking variables
Personal
Example
Memory Loss Boosts Risk of Death
x
Memory Loss Boosts Risk of Death
Cognitive
Impairment
Qty Lifespan
median month
Mortality
None 3157 138 57%
Mild 533 106 68%
Moderate
267 63 79%
Severe
Were there any
missteps in the
analysis?
X
Key Points in Article
• About 4000 men & women
• Aged 60 to 102
• Indianapolis, Indiana. USA
• Started in early 1990’s; ended 2006
• Lower socio-economic background
• 10 questions to assess mental status
• Primary care Dr appt
• No intervening follow-up of mental assessment
Memory Loss Boosts Risk of Death
Potential missteps in analysis
1) Applies equally for men and women?
 Stratified data?
1) Indy only? OR for all USA? OR World Wide?
2) Only applies to those that go to Doctor?
4) Socio-economic Background– What role does it play?
 Three possible Extrapolations
5) How repeatable was 10 question assessment?
 Measurement system
6) Why combine Moderate & Severe?
7) Depends on fewer points at extreme
 Bimodal?
6) And the Big Misstep
 Lurking variable!!!
Key Points in Article
• About 4000 men & women
• Aged 60 to 102
• Indianapolis, Indiana. USA
• Started in early 1990’s; ended 2006
• Lower socio-economic background
• 10 questions to assess mental status
• Primary care Dr appt
• No intervening follow-up of mental
assessment
Cognitive
Impa irment
Q ty Lifespa n
median month
Morta lity
None 3157 138 57%
Mild 533 106 68%
Moderate
267 63 79%
Severe
Cognitive
Impa irment
Q ty Lifespa n
median month
Morta lity
None 3157 138 57%
Mild 533 106 68%
Moderate
267 63 79%
Severe
Memory Loss Boosts Risk of Death
Potential missteps in analysis
1) Applies equally for men and women?
 Stratified data?
1) Indy only? OR for all USA? OR WW?
2) Only applies to those that go to Doctor?
4) Socio-economic Background– What role does it play?
 Three possible Extrapolations
5) How repeatable was 10 question assessment?
 Measurement system
6) Why combine Moderate & Severe?
7) Depends on fewer points at extreme
 Bimodal?
6) And the Biggie ….
 A Lurking variable!!!
Key Points in Article
• About 4000 men & women
• Aged 60 to 102
• Indianapolis, Indiana. USA
• Started in early 1990’s; ended 2006
• Lower socio-economic background
• 10 questions to assess mental status
• Primary care Dr appt
• No intervening follow-up of mental assessment
Cognitive
Im pa irm ent
Q ty Lifespa n
median month
Morta lity
None 3157 138 57%
Mild 533 106 68%
Moderate
267 63 79%
Severe
Cognitive
Im pa irm ent
Q ty Lifespa n
median month
Morta lity
None 3157 138 57%
Mild 533 106 68%
Moderate
267 63 79%
Severe
Does Cognitive Impairment hasten death?
OR
Does Age Boost the Risk of Death?
Questions?

More Related Content

What's hot

Introduction to Structural Equation Modeling
Introduction to Structural Equation ModelingIntroduction to Structural Equation Modeling
Introduction to Structural Equation ModelingAzmi Mohd Tamil
 
Ibmathstudiesinternalassessmentfinaldraft 101208070253-phpapp02
Ibmathstudiesinternalassessmentfinaldraft 101208070253-phpapp02Ibmathstudiesinternalassessmentfinaldraft 101208070253-phpapp02
Ibmathstudiesinternalassessmentfinaldraft 101208070253-phpapp02Travis Hayes
 
Null hypothesis for partial correlation
Null hypothesis for partial correlationNull hypothesis for partial correlation
Null hypothesis for partial correlationKen Plummer
 
Null hypothesis for Pearson Correlation (independence)
Null hypothesis for Pearson Correlation (independence)Null hypothesis for Pearson Correlation (independence)
Null hypothesis for Pearson Correlation (independence)Ken Plummer
 
Point biserial correlation example
Point biserial correlation examplePoint biserial correlation example
Point biserial correlation exampleMuhammad Khalil
 

What's hot (9)

Introduction to Structural Equation Modeling
Introduction to Structural Equation ModelingIntroduction to Structural Equation Modeling
Introduction to Structural Equation Modeling
 
Ibmathstudiesinternalassessmentfinaldraft 101208070253-phpapp02
Ibmathstudiesinternalassessmentfinaldraft 101208070253-phpapp02Ibmathstudiesinternalassessmentfinaldraft 101208070253-phpapp02
Ibmathstudiesinternalassessmentfinaldraft 101208070253-phpapp02
 
Null hypothesis for partial correlation
Null hypothesis for partial correlationNull hypothesis for partial correlation
Null hypothesis for partial correlation
 
Ppt
PptPpt
Ppt
 
Econometrics - lecture 18 and 19
Econometrics - lecture 18 and 19Econometrics - lecture 18 and 19
Econometrics - lecture 18 and 19
 
Multiple regression
Multiple regressionMultiple regression
Multiple regression
 
Null hypothesis for Pearson Correlation (independence)
Null hypothesis for Pearson Correlation (independence)Null hypothesis for Pearson Correlation (independence)
Null hypothesis for Pearson Correlation (independence)
 
Point biserial correlation example
Point biserial correlation examplePoint biserial correlation example
Point biserial correlation example
 
Final report mkt
Final report mktFinal report mkt
Final report mkt
 

Viewers also liked

Uu 5 2014_asn
Uu 5 2014_asnUu 5 2014_asn
Uu 5 2014_asnpemali316
 
Cardinal Health Sales Pitch Presentation
Cardinal Health Sales Pitch PresentationCardinal Health Sales Pitch Presentation
Cardinal Health Sales Pitch PresentationMegan Zart
 
Julia E. Dean Pottery Public Relations Campaign Presentation
Julia E. Dean Pottery Public Relations Campaign PresentationJulia E. Dean Pottery Public Relations Campaign Presentation
Julia E. Dean Pottery Public Relations Campaign PresentationMegan Zart
 
RUSSELLWILSON_OCT2013
RUSSELLWILSON_OCT2013RUSSELLWILSON_OCT2013
RUSSELLWILSON_OCT2013Sam DeHority
 
THE MAGIC OF THINKING BIG
THE MAGIC OF THINKING BIGTHE MAGIC OF THINKING BIG
THE MAGIC OF THINKING BIGAnusha Pilla
 
VISION BOARD 2015_DDT
VISION BOARD 2015_DDTVISION BOARD 2015_DDT
VISION BOARD 2015_DDTDenaya Todd
 
Kashi Frozen Pizza Media Plan Presentation
Kashi Frozen Pizza Media Plan PresentationKashi Frozen Pizza Media Plan Presentation
Kashi Frozen Pizza Media Plan PresentationMegan Zart
 
McDonald's Nocturnivore PR Campaign Case Study
McDonald's Nocturnivore PR Campaign Case StudyMcDonald's Nocturnivore PR Campaign Case Study
McDonald's Nocturnivore PR Campaign Case StudyMegan Zart
 
Lineamientos para la elaboración de programas de trabajo
Lineamientos para la elaboración de programas de trabajoLineamientos para la elaboración de programas de trabajo
Lineamientos para la elaboración de programas de trabajoJanet Nava
 

Viewers also liked (14)

Uu 5 2014_asn
Uu 5 2014_asnUu 5 2014_asn
Uu 5 2014_asn
 
Cardinal Health Sales Pitch Presentation
Cardinal Health Sales Pitch PresentationCardinal Health Sales Pitch Presentation
Cardinal Health Sales Pitch Presentation
 
Grupo1
Grupo1Grupo1
Grupo1
 
Julia E. Dean Pottery Public Relations Campaign Presentation
Julia E. Dean Pottery Public Relations Campaign PresentationJulia E. Dean Pottery Public Relations Campaign Presentation
Julia E. Dean Pottery Public Relations Campaign Presentation
 
RUSSELLWILSON_OCT2013
RUSSELLWILSON_OCT2013RUSSELLWILSON_OCT2013
RUSSELLWILSON_OCT2013
 
Jayden
Jayden Jayden
Jayden
 
THE MAGIC OF THINKING BIG
THE MAGIC OF THINKING BIGTHE MAGIC OF THINKING BIG
THE MAGIC OF THINKING BIG
 
VISION BOARD 2015_DDT
VISION BOARD 2015_DDTVISION BOARD 2015_DDT
VISION BOARD 2015_DDT
 
Kashi Frozen Pizza Media Plan Presentation
Kashi Frozen Pizza Media Plan PresentationKashi Frozen Pizza Media Plan Presentation
Kashi Frozen Pizza Media Plan Presentation
 
Извори енергије
Извори енергијеИзвори енергије
Извори енергије
 
Долазак Срба на Балкан - 4. разред
Долазак Срба на Балкан - 4. разредДолазак Срба на Балкан - 4. разред
Долазак Срба на Балкан - 4. разред
 
McDonald's Nocturnivore PR Campaign Case Study
McDonald's Nocturnivore PR Campaign Case StudyMcDonald's Nocturnivore PR Campaign Case Study
McDonald's Nocturnivore PR Campaign Case Study
 
Оријентација у времену: Календар
Оријентација у времену: КалендарОријентација у времену: Календар
Оријентација у времену: Календар
 
Lineamientos para la elaboración de programas de trabajo
Lineamientos para la elaboración de programas de trabajoLineamientos para la elaboración de programas de trabajo
Lineamientos para la elaboración de programas de trabajo
 

Similar to Scatterplots and Cautions of Correlation

Correlation biostatistics
Correlation biostatisticsCorrelation biostatistics
Correlation biostatisticsLekhan Lodhi
 
Mixed Effects Models - Random Intercepts
Mixed Effects Models - Random InterceptsMixed Effects Models - Random Intercepts
Mixed Effects Models - Random InterceptsScott Fraundorf
 
Topic 5 (multiple regression)
Topic 5 (multiple regression)Topic 5 (multiple regression)
Topic 5 (multiple regression)Ryan Herzog
 
Biostatistics lecture notes 7.ppt
Biostatistics lecture notes 7.pptBiostatistics lecture notes 7.ppt
Biostatistics lecture notes 7.pptletayh2016
 
Dispersion
DispersionDispersion
DispersionL H
 
ONS Guide to Social and Economic Research – Welsh Baccalaureate Teacher Training
ONS Guide to Social and Economic Research – Welsh Baccalaureate Teacher TrainingONS Guide to Social and Economic Research – Welsh Baccalaureate Teacher Training
ONS Guide to Social and Economic Research – Welsh Baccalaureate Teacher TrainingOffice for National Statistics
 
Research method ch09 statistical methods 3 estimation np
Research method ch09 statistical methods 3 estimation npResearch method ch09 statistical methods 3 estimation np
Research method ch09 statistical methods 3 estimation npnaranbatn
 
Student Affairs Assessment Committee Training Part 2
Student Affairs Assessment Committee Training Part 2Student Affairs Assessment Committee Training Part 2
Student Affairs Assessment Committee Training Part 2Stan Dura
 
BUS 308 Week 5 Lecture 3 A Different View Effect Sizes .docx
BUS 308 Week 5 Lecture 3 A Different View Effect Sizes .docxBUS 308 Week 5 Lecture 3 A Different View Effect Sizes .docx
BUS 308 Week 5 Lecture 3 A Different View Effect Sizes .docxcurwenmichaela
 
5.2.1 dags
5.2.1 dags5.2.1 dags
5.2.1 dagsA M
 
R - what do the numbers mean? #RStats
R - what do the numbers mean? #RStatsR - what do the numbers mean? #RStats
R - what do the numbers mean? #RStatsJen Stirrup
 
Topic 5 (multiple regression)
Topic 5 (multiple regression)Topic 5 (multiple regression)
Topic 5 (multiple regression)Ryan Herzog
 
You clearly understand the concepts of this assignment. You’ve don.docx
You clearly understand the concepts of this assignment. You’ve don.docxYou clearly understand the concepts of this assignment. You’ve don.docx
You clearly understand the concepts of this assignment. You’ve don.docxjeffevans62972
 
Personalized Learning.pptx
Personalized Learning.pptxPersonalized Learning.pptx
Personalized Learning.pptxWillSoo1
 
Measure of Relationship: Correlation Coefficient
Measure of Relationship: Correlation CoefficientMeasure of Relationship: Correlation Coefficient
Measure of Relationship: Correlation CoefficientLade Asrah Carim
 
7 measurement & questionnaires design (Dr. Mai,2014)
7 measurement & questionnaires design (Dr. Mai,2014)7 measurement & questionnaires design (Dr. Mai,2014)
7 measurement & questionnaires design (Dr. Mai,2014)Phong Đá
 

Similar to Scatterplots and Cautions of Correlation (20)

Correlation biostatistics
Correlation biostatisticsCorrelation biostatistics
Correlation biostatistics
 
Inferential Statistics
Inferential StatisticsInferential Statistics
Inferential Statistics
 
Mixed Effects Models - Random Intercepts
Mixed Effects Models - Random InterceptsMixed Effects Models - Random Intercepts
Mixed Effects Models - Random Intercepts
 
Topic 5 (multiple regression)
Topic 5 (multiple regression)Topic 5 (multiple regression)
Topic 5 (multiple regression)
 
Lecture 4
Lecture 4Lecture 4
Lecture 4
 
Biostatistics lecture notes 7.ppt
Biostatistics lecture notes 7.pptBiostatistics lecture notes 7.ppt
Biostatistics lecture notes 7.ppt
 
Correlation
CorrelationCorrelation
Correlation
 
Dispersion
DispersionDispersion
Dispersion
 
MD poverty indexes
MD poverty indexesMD poverty indexes
MD poverty indexes
 
ONS Guide to Social and Economic Research – Welsh Baccalaureate Teacher Training
ONS Guide to Social and Economic Research – Welsh Baccalaureate Teacher TrainingONS Guide to Social and Economic Research – Welsh Baccalaureate Teacher Training
ONS Guide to Social and Economic Research – Welsh Baccalaureate Teacher Training
 
Research method ch09 statistical methods 3 estimation np
Research method ch09 statistical methods 3 estimation npResearch method ch09 statistical methods 3 estimation np
Research method ch09 statistical methods 3 estimation np
 
Student Affairs Assessment Committee Training Part 2
Student Affairs Assessment Committee Training Part 2Student Affairs Assessment Committee Training Part 2
Student Affairs Assessment Committee Training Part 2
 
BUS 308 Week 5 Lecture 3 A Different View Effect Sizes .docx
BUS 308 Week 5 Lecture 3 A Different View Effect Sizes .docxBUS 308 Week 5 Lecture 3 A Different View Effect Sizes .docx
BUS 308 Week 5 Lecture 3 A Different View Effect Sizes .docx
 
5.2.1 dags
5.2.1 dags5.2.1 dags
5.2.1 dags
 
R - what do the numbers mean? #RStats
R - what do the numbers mean? #RStatsR - what do the numbers mean? #RStats
R - what do the numbers mean? #RStats
 
Topic 5 (multiple regression)
Topic 5 (multiple regression)Topic 5 (multiple regression)
Topic 5 (multiple regression)
 
You clearly understand the concepts of this assignment. You’ve don.docx
You clearly understand the concepts of this assignment. You’ve don.docxYou clearly understand the concepts of this assignment. You’ve don.docx
You clearly understand the concepts of this assignment. You’ve don.docx
 
Personalized Learning.pptx
Personalized Learning.pptxPersonalized Learning.pptx
Personalized Learning.pptx
 
Measure of Relationship: Correlation Coefficient
Measure of Relationship: Correlation CoefficientMeasure of Relationship: Correlation Coefficient
Measure of Relationship: Correlation Coefficient
 
7 measurement & questionnaires design (Dr. Mai,2014)
7 measurement & questionnaires design (Dr. Mai,2014)7 measurement & questionnaires design (Dr. Mai,2014)
7 measurement & questionnaires design (Dr. Mai,2014)
 

Scatterplots and Cautions of Correlation

  • 1. © 2006 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Oleg Janke 25-June-2012 Tip of the hat to Kevin Kacmarynski Scatterplots and Cautions of Correlation
  • 2. Topics for Today’s Discussion • Scatterplots: − Why bother? − Creating & Analyzing − Good usage • Correlation and Association • Potential Missteps • Summary • Real Personal Application
  • 3. Presumption • Basic graphing skills • Familiar with charting (for example, as Excel) • YES NO
  • 4. Scatterplots: Why bother? • Scatterplot (Scatter diagram) − Converts two columns of numbers (ordered pairs) into picture − Explores relationship between two quantitative variables • What value does it have? − Determine possible cause and effect links (control) − Predict results of variable that is difficult to measure if it is strongly related to another variable that is easier to measure (proxy)
  • 5. Creating a scatterplot What is the relationship between height & weight?
  • 6. Creating a scatterplot Plot each characteristic of interest on a standard XY plot. Katie James
  • 7. Analyzing a scatterplot Is there a relationship? OR Is it just randomness? (N=40)
  • 8. Analyzing a scatterplot Is there a relationship? OR Is it just randomness? (N=40) How confident are you that there is a linear relationship between height and weight in this data set? (Choose one) • 100% • 99-100% • 95-99% • 90-95% • 80-90% • insufficient data to say • no relationship
  • 9. Analyzing a scatterplot Is there a relationship or is it just randomness? (N=40) • Add Median lines and count quadrant totals+ Median X 14 6 6 14 Median Y + Olmstead-Tukey 1947
  • 10. Analyzing a scatterplot Is there a relationship or is it just randomness? (N=40) • Add Median lines and count quadrant totals+ Median X 14 6 6 14 Median Y + Olmstead-Tukey 1947 NO relationship • shotgun effect • appx equal number in each quadrant IS a relationship • one diagonal will dominate
  • 11. Analyzing a scatterplot Is there a relationship or is it just randomness? (N=40) • Add Median lines and count quadrant totals Median X 14 6 6 14 Median Y •Less than 5% chance data could align this way simply from randomness • Therefore fairly confident X& Y are related SIGN TEST TABLE * N 1% 5% 10 0 1 20 3 5 30 7 9 40 11 13 50 15 17 60 19 21 * Ishikawa “Guide to Quality Control”, 1976
  • 12. Good usage of Scatterplots • In this plot, we observe a clear relationship between height and weight • As height of individuals increase, their weight tends to increase as well • In the ideal case this relationship is called Body Mass Index (BMI)
  • 13. Good usage of scatterplots Scenario 1: We are building parts on one line in one location. What does the plot tell us about part length and part diameter?
  • 14. Good usage of scatterplots Scenario 2: We build the same part on two different Lines. Now what does plot tell us about part length & part diameter?
  • 15. Good usage of scatterplots • IS a Line effect here • Relationship between Diam & Length differs by Line • Diam1 twice Daim2 − Tighter process control? • Length1 < Length2 Now what does plot tell us about part length & part diameter? • Always be alert to possible strata in the data • Plotting your data is crucial for discovery
  • 16. Correlation Coefficient • Correlation is defined as measure of strength of linear relationship between two quantitative variables − Correlation coefficient is a mathematically calculated value: − Correlation values are always between -1 and +1 • 0 indicates no correlation (perfect shotgun pattern) • -1 and +1 indicates perfect correlation (all points fall on line) • Sign indicates direction − Positive: up and to right − Negative: down and to left
  • 17. Correlation and Association • Re-visiting our first example, we saw strong, positive relationship between height and weight • Supported by correlation coefficient value of 0.709 • Relationship exists • Does NOT prove causality Correlation = 0.709 Calculation provided by JMP statistical software
  • 18. Correlation and Association Medical Trial • Dosage 490-510 mg • Recorded therapeutic response from 20 to 100 Is there a correlation between Dosage and Response? • Yes • No • Insufficient data • Don’t know
  • 19. Correlation and Association • Since calculated correlation value is zero, there is no association between dosage and desired response! Right? − No LINEAR relationship • Correlation coefficient, by itself, does not tell the entire story • Always look at your graphs to see what the data say Correlation coefficient = 0 Calculation provided by JMP statistical software 2 2 2 2
  • 20. Questions? How to create/analyze Scatterplots Correlation and Associations
  • 21. Missteps with scatterplots & correlation 1. Bimodal distributions 2. Stratified data 3. Lurking variables 4. Extrapolation 5. Too narrow range of X (independent variable) 6. Weak/sloppy measurement 7. Chicken and Egg Syndrome
  • 22. Misstep #1with scatterplots & correlation • What is my house worth? − Sale price data & house size were collected on 21 houses in the same town − Another house (mine) in same town is 2300 square feet in size, so it should be worth a little over $200K − Correlation coefficient = 0.943 (very high) What is the problemin this analysis and the resulting conclusion?
  • 23. Misstep #1with scatterplots & correlation What is the problem? • Relationship/correlation dependent solely on one data point • Why might this one point not be appropriate? • Location (school district, suburban) • Features (pool, lot, barn, view) • Timing (peak of housing bubble) • Example of Bimodal data Need both appropriate data & proper analysis techniques 4 7 6 4 Is there a linear relationship? • What do median lines say about the relationship? Sign Test Table N 1% 5% 10 0 1 20 3 5 30 7 9 40 11 13 50 15 17
  • 24. Misstep #2 with scatterplots & correlation • Based on this data set, with high Correlation Coefficient (0.780) what’s the relationship between shoe size and knowledge? • What’s missing? Correlation = 0.780 Calculation provided by JMP statistical software
  • 25. Misstep #2 with scatterplots & correlation • Does this help solve mystery? • Be sure to look for hidden variables that might have an impact on relationship • Stratified Data– sub population with different relationships – can give erroneous conclusions For example: CSAT data • Those who respond to survey • Those who do NOT respond to survey Do both groups have similar opinions?
  • 26. Misstep #3 with scatterplots & correlation Beware of Lurking Variables • Related thru common 3rd variable − Ice cream sales correlates with water usage (temperature) − Height–weight example (age) − Call vol at hp Call Center A correlated w/call vol at hp CC--B (business) • Related thru independent growth (decay) rates − Population in Indonesia correlates with price of tea in NYC (growth) − My car’s value correlates to grams of Cobalt-60 isotope (decay) • Both have half-lives of about 5-6 years • Related through measuring same characteristic differently − Weight in pounds is correlated to weight in kilos − Attendance at an event is correlated to empty seats at same event − Area of a US state is correlated with population of that state • Some notable exceptions (AK, MT)
  • 27. Does School Spending Educate Students? States spending more per student have lower SAT* scores! Expediture/pupil by State vs SAT 900 950 1000 1050 1100 1150 1200 $4,000 $5,000 $6,000 $7,000 $8,000 $9,000 $10,000 $11,000 $12,000 Expenditure per pupil for public school K-12 (2002-03) SATscoresfor1998 Obvious Negative Correlation * SAT test is a standardized test used by many colleges across US to determine level of student preparedness for college
  • 28. Does School Spending Educate Students? States spending more per student have lower SAT scores! Expediture/pupil by State vs SAT 900 950 1000 1050 1100 1150 1200 $4,000 $5,000 $6,000 $7,000 $8,000 $9,000 $10,000 $11,000 $12,000 Expenditure per pupil for public school K-12 (2002-03) SATscoresfor1998 Negative Correlation Do we have a measurement issue here? What does SAT scores actually measure? • Test performance – at a minimum • Education – Not always correlated with Knowledge • Knowledge – Our belief that this leads to Life Success • Life Success -- This is what we would like to be the case Be careful of proxies that stand in for other measures
  • 29. Does School Spending Educate Students? States spending more per student have lower SAT scores! Expediture/pupil by State vs SAT 900 950 1000 1050 1100 1150 1200 $4,000 $5,000 $6,000 $7,000 $8,000 $9,000 $10,000 $11,000 $12,000 Expenditure per pupil for public school K-12 (2002-03) SATscoresfor1998 Negative Correlation Do we have a measurement issue here? No! SAT scores are predictive of Life Success-- financial • Life Success – college grad, good job • Not with certainty, but on average • Should we move to states with lower student spending?
  • 30. Does School Spending Educate Students? Does Percent of students taking SAT impact SAT scores? Correlation Coefficient = .92 Beware the Lurking Variable! 1998 SAT by State y = 1278x-0.0575 R2 = 0.8461 900 1000 1100 1200 0 10 20 30 40 50 60 70 80 90 % Taking SAT CompositeSAT OR SC WV DC NH GA WA AK CO IL M N TX MS OH KS WI USA UT M A NY NJ CT INHI NC NV M T VT VT PA RI MSMSMS
  • 31. Misstep #4 with scatterplots & correlation Back to Height & Weight data • How much would an 100 inch (~2.5meters) person weigh? 80 90 100 300 280 260 240 220 200 •From scatterplot he would weight ~300# (136 kg) •Can we make this prediction?
  • 32. Misstep #4 with scatterplots & correlation • “Predicted” 300 pounds! •Robert Wadlow was 8 ft 11 (2.72 m) and weighed 439# (199 kg) • Interpolating within range of independent variable set -- acceptable •Extrapolating beyond range of independent variable is dangerous • Relationship may not be stable
  • 33. Misstep #5 with scatterplots & correlation Back to Height & Weight data • What would conclusion be if height ranged from 1600.0 mm to 1700.0 mm (64-66 inches)? • Easily conclude no relationship between height and weight • Make sure range of independent variable (X) sufficiently large relative to dependent variable (Y)
  • 34. Misstep #6 with scatterplots & correlation Poor Measurement System • Inappropriate tool or gage to measure − Pixel width with standard yard/meter stick − Monitor response time with second hand on watch • Weak tech repeatability − Tech visual determination of damage of NB set in for repair − Typo-graphical errors on a written page Ensure Measurement System Analysis performed before data are collected
  • 35. Misstep #7 with scatterplots & correlation Chicken and Egg Syndrome • Which came first? • What is the cause and what is the effect? − Do children from poor families do poorly academically because they are poor OR are they poor because of poor academic performance? − Do consumers buy good product out of loyalty OR are consumers loyal because of good product? • Vicious/Virtuous Cycles – hard to break through • Relationship is there; Causality is not easily determined
  • 36. Misstep Summary Watch out for …. • Bimodal distribution House Size and Price • Stratified data  Shoe Size and Knowledge • Lurking variable  SAT Scores and Participation Rate − Underlying third variable − Common but unrelated growth/decay curves − Same variable measured differently • Extrapolation  Height and Weight for tallest man − Generalizing from a sampled subset to a broader, larger population • Narrow range of X  Height and Weight • Sloppy measurement − Can hide a real relationship − May create one when none exists • Chicken and Egg Syndrome − Variables are related, but which is cause & which is effect?
  • 38. Summary • Scatterplots − Simple, but powerful tool to explore relationships between two quantitative variables • Be sure data are representative of question − “What are we trying to accomplish?” • Plot data to lookforanomalies orassociations • Correlation has special meaning − Correlation does not imply causation − Nor does lack of correlation deny causation • Recall Missteps that may impact scatterplot/correlation analysis including lurking variables
  • 40. Memory Loss Boosts Risk of Death
  • 41. x Memory Loss Boosts Risk of Death Cognitive Impairment Qty Lifespan median month Mortality None 3157 138 57% Mild 533 106 68% Moderate 267 63 79% Severe Were there any missteps in the analysis? X Key Points in Article • About 4000 men & women • Aged 60 to 102 • Indianapolis, Indiana. USA • Started in early 1990’s; ended 2006 • Lower socio-economic background • 10 questions to assess mental status • Primary care Dr appt • No intervening follow-up of mental assessment
  • 42. Memory Loss Boosts Risk of Death Potential missteps in analysis 1) Applies equally for men and women?  Stratified data? 1) Indy only? OR for all USA? OR World Wide? 2) Only applies to those that go to Doctor? 4) Socio-economic Background– What role does it play?  Three possible Extrapolations 5) How repeatable was 10 question assessment?  Measurement system 6) Why combine Moderate & Severe? 7) Depends on fewer points at extreme  Bimodal? 6) And the Big Misstep  Lurking variable!!! Key Points in Article • About 4000 men & women • Aged 60 to 102 • Indianapolis, Indiana. USA • Started in early 1990’s; ended 2006 • Lower socio-economic background • 10 questions to assess mental status • Primary care Dr appt • No intervening follow-up of mental assessment Cognitive Impa irment Q ty Lifespa n median month Morta lity None 3157 138 57% Mild 533 106 68% Moderate 267 63 79% Severe Cognitive Impa irment Q ty Lifespa n median month Morta lity None 3157 138 57% Mild 533 106 68% Moderate 267 63 79% Severe
  • 43. Memory Loss Boosts Risk of Death Potential missteps in analysis 1) Applies equally for men and women?  Stratified data? 1) Indy only? OR for all USA? OR WW? 2) Only applies to those that go to Doctor? 4) Socio-economic Background– What role does it play?  Three possible Extrapolations 5) How repeatable was 10 question assessment?  Measurement system 6) Why combine Moderate & Severe? 7) Depends on fewer points at extreme  Bimodal? 6) And the Biggie ….  A Lurking variable!!! Key Points in Article • About 4000 men & women • Aged 60 to 102 • Indianapolis, Indiana. USA • Started in early 1990’s; ended 2006 • Lower socio-economic background • 10 questions to assess mental status • Primary care Dr appt • No intervening follow-up of mental assessment Cognitive Im pa irm ent Q ty Lifespa n median month Morta lity None 3157 138 57% Mild 533 106 68% Moderate 267 63 79% Severe Cognitive Im pa irm ent Q ty Lifespa n median month Morta lity None 3157 138 57% Mild 533 106 68% Moderate 267 63 79% Severe Does Cognitive Impairment hasten death? OR Does Age Boost the Risk of Death?

Editor's Notes

  1. Cover the basics quickly Then get into the Cautions. And end up with a real life application
  2. Cause and Effect Link is most powerful, but also fraught with pitfalls Examples: Automotive air bag inflators or matches
  3. 40 pair of data Tough to understand the relationship between 2 variables from two columns of numbers; gets tougher with more data … 100’s or 1000’s of records
  4. Picture is worth a 1000 words Can be easily done by hand for small datasets; OR for larger sets many scatter plot apps can be used -- including Excel and Minitab Typically the independent/cause is the X axis &amp; the dependent/effect is on the Y axis
  5. PERSONAL EXAMPLE We had 2 different products going down same line and tracked yield on daily basis. Production supervisor was convinced and showed specific daily results where the two products tracked – they had high yield and had low yield on the same dates. But when the full picture was created with ALL the data the quadrant totals were 14,13,13,13. Mtg was over in 5 minutes. Sometimes we see/hear only what we want to see/hear.
  6. We have an association between dosage and response
  7. Really only have 2 data points here, in spite of the N=21. One pt is well substantiated (n=20), but the other point is really light. It may be more or less like the other 20 houses, and hence not comparable. FEATURES: nice pool or large lot or barn or great view LOCATION: in bad school district or next to landfill or toxic site TIMING: did this one sell at the height of the housing bubble? Good example were old fashioned median line analysis more correct than high powered mathematical calculation.
  8. This does not make sense, that shoe size determines knowledge – otherwise all the nerds would be basketball players.
  9. When you have multiple correlations and regressions, and more complex variables, the lurking variables (hidden variables) may not be so readily apparent.
  10. Data presented here makes strong case that money does NOT buy education/knowledge/test performance
  11. Data presented here makes strong case that money does NOT buy education/knowledge/test performance
  12. Data presented here makes strong case that money does NOT buy education/knowledge/test performance
  13. b/c participation rate varies by state, we can not simply compare state spending per student. Need to account/allow for participation rate
  14. All of plot could be turned 90 degrees, and the relationship would still be there.
  15. This hit really close to home for me. I am older than most people at hp I mix my kids names’ up all the time I forget where I left my keys on regular basis
  16. Can’t say for sure Only that these factors were NOT addressed in the article. (Could be oversight by the author)
  17. Questions arose for me b/c not clear in summarized article found in WebMD. Not sure if this was covered in the original research or not. Normally women live longer than men. But NO mention of the effect of gender, though both were included (possible strata – like shoe size example) The authors extrapolated from Indianapolis to US and possibly WW. May or may not be reasonable (extrapolation) Only those folks that went to see a Primary Care Dr. What about those that did not? (extrapolation) Likewise authors extended conclusions from lower socio-economic background to whole population (extrapolation) 10 questions to assess mental status seems a bit thin. Wish it were so easy. (GR&amp;R) Were there not enough “Severe”? Were the results for “Severe” contrary to the author’s position? (weird) Although started with 4000, most of these are the baseline (NONE), much fewer (7%) at extreme (almost bimodal distribution) Lurking variable of Agedness …. Hope this research was not done with a Government Grant Money
  18. Questions arose for me b/c not clear in summarized article found in WebMD. Not sure if this was covered in the original research or not. Normally women live longer than men. But No mention of the effect of gender, though both were included (possible strata – like shoe size example) The authors extrapolated from Indianapolis to US and possibly WW. May or may not be reasonable (extrapolation) Only those folks that went to see a Primary Care Dr. What about those that did not? (extrapolation) Likewise authors extrapolated from lower socio-economic background to whole population (extrapolation) 10 questions to assess mental status seems a bit thin. Wish it were so easy. (GR&amp;R) were there not enough “Severe”? Were the results for “Severe” contrary to the author’s position? (weird) although started with 4000, most of these are the baseline (NONE), much fewer (7%) at extreme (almost bimodal distribution) Lurking variable of Agedness …. Hope this research was not done with a Government Grant Money