SlideShare a Scribd company logo
1 of 31
Download to read offline
Data Analysis and Statistical Methods
Statistics 651
http://www.stat.tamu.edu/~suhasini/teaching.html
Lecture 10 (MWF) Checking for normality of the data using the QQplot
Suhasini Subba Rao
Lecture 10 (MWF) QQplots
Checking for Normality (with the empirical rule)
• Suppose x1, . . . , xn is a sample from a normal distribution with mean µ
and variance σ2
.
• First we order them from the smallest number to the largest number:
x(1), . . . , x(n).
• Estimate the mean and standard deviations from the data; x̄ and s.
• Plot all the observations on a number line. Locate the mean x̄ on this line
and also the intervals: [x̄−s, x̄+s], [x̄−2s, x̄+2s] and [x̄−3s, x̄+3s].
• If the observations came from a normal, then
– Roughly 68% of the observations should lie in the interval [x̄−s, x̄+s].
1
Lecture 10 (MWF) QQplots
– 95% of the observations should lie in the interval [x̄ − 2s, x̄ + 2s].
– 99.7% of the observations should lie in the interval [x̄ − 3s, x̄ + 3s].
• Remember this means counting the number of points in each interval,
and dividing it by the total number of observations.
2
Lecture 10 (MWF) QQplots
Example: Minimum temperatures
The mean of this distribution
is -10C and the standard
deviation is 10C. Calculate
the proportion within one
standard deviation, two
standard deviations etc.
The mean is µ = −13.8 and
σ = 9.4.
3
Lecture 10 (MWF) QQplots
• This is an extremely rough way to check for normality.
• There can exist weird non-normal distributions where the following:
– Roughly 68% of the observations should lie in the interval [x̄−s, x̄+s].
– 95% of the observations should lie in the interval [x̄ − 2s, x̄ + 2s].
– 99.7% of the observations should lie in the interval [x̄ − 3s, x̄ + 3s].
could be true.
4
Lecture 10 (MWF) QQplots
Motivating the QQplot
• A QQplots orders the data from the smallest to the largest and plots the
data against corresponding normal quantile. This allows on to check for
normality of the data. Precisely:
– Data X1, . . . , Xn ordered from smallest to largest X(1), . . . , X(n).
– Plot X(i) against the i/n quantile of the normal distribution (omitting
the first and last observations).
• If the data comes from a normal distribution (with the mean and variance
estimated from the data) the data (empirical quantiles) will match the
normal quantiles, and plot should lie on a straightline (on the x = y
line).
• A QQplot has nothing to do with linear regression. The line you see in
the plot is not the line of best fit.
5
Lecture 10 (MWF) QQplots
Checking for normality: The QQ plot
• This plots what has been described above.
• The QQplot consists of points and a straight 45 degree line.
X
X
X
X
X
(1)
(2)
(3)
(4)
y y y y y
(1) (2) (3) (4)
(5)
(5)
..
. .
.
x=y line
• If the points tend to lie on the straightline, then this suggests the
observations come from a normal distribution.
6
Lecture 10 (MWF) QQplots
Making a QQplot in JMP
• Always use software to make a QQplot.
• Analyze > Distribution. A window will pop-up. Highlight the variable,
press Y, Columns and press okay.
• Once the histogram pops up, click on the red triangle. Click on the
Normal Quantile Plot.
7
Lecture 10 (MWF) QQplots
Example: Antarctic maximum temperatures
This is the histogram and QQplot of the maximum temperatures in
Anarctica.
It is important to compare the black
points with the red straight line (not
the red dashed line).
• It would appear that the maximum temperatures are close to normal.
The mean of this data set is µ = 4.5 and the standard deviation is
σ = 2.16.
8
Lecture 10 (MWF) QQplots
• This means we can calculate proportions using the normal distribution.
• Question This month the maximum temperature is 7 degrees, what is
its percentile? Answer We assume normality: P(X ≤ 7) = P(Z ≤
(7 − 4.5)/2.16) = 0.87. Assuming normality, 7C degreees is in the 87%
percentile.
The percentile can be checked, by
using the actual data to calculate the
percentile. Based on the data the
proportion of temperatures less than
7 degrees is about 86.5%. Since 87%
and 86.5% are very close we see that
the normal distribution approximates
well the distribution of maximum
temperatures.
9
Lecture 10 (MWF) QQplots
Example: Antarctic minimum temperature QQplot
This is the histogram and QQplot of the maximum temperatures in
Anarctica.
• The distribution of minimum temperatures is far from normal. The
mean and standard deviation of this data set is −13.8 degrees and 9.3
respectively.
10
Lecture 10 (MWF) QQplots
• If we use normality of the data to calculate precentile corresponding to
−10C P(X ≤ −10) = P(Z = −10+13.8
9.4 ) = 0.654 (about 65.4%).
But, based on the data the
proportion of temperatures
less than −10 degrees is
about 55%, which is quite
different to the proportion
calculated using the normal
distribution. Approximating
the distribution with a normal
distribution is giving incorrect
percentiles/probabilities.
11
Lecture 10 (MWF) QQplots
Interpretating a QQ-plot
• Some experienced statisticans have shaman like powers when it comes
to interpretating QQ-plots.
• You don’t need them, but it is good to have a feel of them.
• There are three main features you need to look for;
– Left Skew. This means the distribution is not symmetric. Find the
mode (the heightest point of the distribution). The right of the mode
should be shorter than the left of the mode.
– Right Skew. This means the distribution is not symmetric. Find the
mode (the heightest point of the distribution). The right of the mode
should be longer than the left of the mode.
– Heavy tails. This means that the probability of large numbers if
much more likely than a normal distribution. For example for a
12
Lecture 10 (MWF) QQplots
normal distribution most the observations 98% lie within the interval
[x̄ − 3s, x̄ + 3s]. For a heavy tail distribution a far smaller proportion
lie in this interval.
13
Lecture 10 (MWF) QQplots
Skewed distributions
• A right skewed distribution (red) has a long right tail (green is normal).
• For a left skewed distribution the QQplot is the mirror image along the
45 degree line (arch going upwards and towards the left).
14
Lecture 10 (MWF) QQplots
A right skewed distribution and the QQplot
• This is right skewed. The QQplots has a “U”.
15
Lecture 10 (MWF) QQplots
QQplot of a left skewed distribution
• The above is indicates a left skewed distribution.
• The points are arched, going from the below the 45 degree line across it
and down again.
16
Lecture 10 (MWF) QQplots
Heavy tail distribution
• Has much thicker tails than a normal distribution (the blue are the tails
of a normal and red are the tails of a thick tail).
17
Lecture 10 (MWF) QQplots
QQplot of a heavy tailed distribution
• The plot is like an ‘S0
. On the left of the plot it is left of the 45 degree
line and then towards the right it goes to being right of the 45 degree
line.
18
Lecture 10 (MWF) QQplots
What does thick tailed distribution mean??
Look at the histogram of the following data set (size 200 observations).
Look at the proportion of points outside one/two and three standard
deviations of the mean (compare with 68%, 95% and 99.8%). It is a lot
more than the normal distribution. Look at the tails, it is higher (thicker)
than the normal distribution.
19
Lecture 10 (MWF) QQplots
The corresponding QQplot
Below we make a QQplot of the above data set.
The ‘S’ shape suggests the distribution has thick tails.
20
Lecture 10 (MWF) QQplots
QQplots: Some general warnings
• When there are a just a few observations. It is extremely difficult to
check for normality using a QQplot. The data below is generated using
a normal distribution, but there are only seven observations. Making it
very difficult to check for normality of the data.
21
Lecture 10 (MWF) QQplots
QQplot of the height data
The heights are not normally distributed. The horizontal lines that we
see is because the data is integer valued (heights are given to the nearest
inch). There is a mild “U” shape which suggest some element of skewness.
22
Lecture 10 (MWF) QQplots
QQplot of the average of 5 heights
This is when we take several samples each of size 5 and for each sample
evaluate the sample mean. Each point of the plot corresponds to one
sample mean. The sample mean not look exactly normal, but the points
are closer to the x = y then the QQplot of the raw heights on the previous
page.
23
Lecture 10 (MWF) QQplots
QQplot of binary data
Let us return to the example of people liking apple juice. 100 people
were interviewed and each person was asked whether they like apple juice
or not (1=yes, 0 = no). Here is the data
http://www.stat.tamu.edu/~suhasini/teaching651/apple_juice.txt.
24
Lecture 10 (MWF) QQplots
• 34% of this sample liked apple juice. This data is binary (not normal!),
this is why you see the two lines.
• It is clearly not normal, and you cannot make it more normal by
increasing the sample size.
• What does become normal is the sample proportion (which in this case
is 34%) - this is due to the CLT, which we discuss in lecture 12. But
only when the sample size is relatively large.
25
Lecture 10 (MWF) QQplots
Simulating data in JMP
Make a new data table. Go to Table > Cols > New Columns.. >
(In Column Properties select Formula) > Select Random in the new pop
window and the distribution you want to simulate from. In the window
above I chose Normal with mean 64.5 and standard deviation 2.5. This
means that the number will draw numbers close to 64.5 with spread 2.5.
26
Lecture 10 (MWF) QQplots
Transforming Data
• If the data is far from normal we often do a transformation of it to make
it have less outliers and less skewed.
Standard transforms for positive data {Xi}
• The log transform;
Yi = log Xi.
The variance of the transformed observation tends to be less than the
variance of the original observation (sometimes this transformation is
called ‘variance stablisation’). Often used when the sample mean and
sample variance of X are close to each other.
27
Lecture 10 (MWF) QQplots
• The power transform;
Yi = Xβ
i β 6= 0.
This transformation tends to control outliers and ‘unskews’ the data.
• The Box-Cox transform Xi;
Yi =
Xλ
i − 1
λ
λ 6= 0.
28
Lecture 10 (MWF) QQplots
Power transformation: Illustration
• Left is a QQplot of the original data and the right is the QQplot of the
square root of the data (i.e. Xi →
√
Xi = X
1/2
i ).
• Observe how the square root of the data is still skewed - but it is less
skewed than the original data. Reducing skewness in data is very useful
way of making the CLT ‘work’ for smaller sample sizes (see later).
29
Lecture 10 (MWF) QQplots
Transforming Baseball salaries
The baseball salary data found here (second to last column) is extremely
right skewed.
https://www.stat.tamu.edu/~suhasini/teaching651/MLBSalaries.
csv
By taking the log transform we reduce the skew, but not completely.
The left hand plot is the histogram and QQplot of the original data. The
right hand plot is the histogram and QQplot of the log-transformed data.
30

More Related Content

Similar to QQ-Plots.............................pdf

Chapter 3 Ken Black 2.ppt
Chapter 3 Ken Black 2.pptChapter 3 Ken Black 2.ppt
Chapter 3 Ken Black 2.pptNurinaSWGotami
 
Normal Distribution
Normal DistributionNormal Distribution
Normal DistributionCIToolkit
 
Sqqs1013 ch6-a122
Sqqs1013 ch6-a122Sqqs1013 ch6-a122
Sqqs1013 ch6-a122kim rae KI
 
Lect 3 background mathematics
Lect 3 background mathematicsLect 3 background mathematics
Lect 3 background mathematicshktripathy
 
3.3 Measures of relative standing and boxplots
3.3 Measures of relative standing and boxplots3.3 Measures of relative standing and boxplots
3.3 Measures of relative standing and boxplotsLong Beach City College
 
Lect 3 background mathematics for Data Mining
Lect 3 background mathematics for Data MiningLect 3 background mathematics for Data Mining
Lect 3 background mathematics for Data Mininghktripathy
 
Normal distribution
Normal distributionNormal distribution
Normal distributionTeratai Layu
 
Chapter 5 part1- The Sampling Distribution of a Sample Mean
Chapter 5 part1- The Sampling Distribution of a Sample MeanChapter 5 part1- The Sampling Distribution of a Sample Mean
Chapter 5 part1- The Sampling Distribution of a Sample Meannszakir
 
Measures of Dispersion.pptx
Measures of Dispersion.pptxMeasures of Dispersion.pptx
Measures of Dispersion.pptxVanmala Buchke
 

Similar to QQ-Plots.............................pdf (20)

Chapter 3 Ken Black 2.ppt
Chapter 3 Ken Black 2.pptChapter 3 Ken Black 2.ppt
Chapter 3 Ken Black 2.ppt
 
Normal Distribution
Normal DistributionNormal Distribution
Normal Distribution
 
Test of significance
Test of significanceTest of significance
Test of significance
 
Normal Distribution
Normal DistributionNormal Distribution
Normal Distribution
 
Sqqs1013 ch6-a122
Sqqs1013 ch6-a122Sqqs1013 ch6-a122
Sqqs1013 ch6-a122
 
Inorganic CHEMISTRY
Inorganic CHEMISTRYInorganic CHEMISTRY
Inorganic CHEMISTRY
 
Lect 3 background mathematics
Lect 3 background mathematicsLect 3 background mathematics
Lect 3 background mathematics
 
Statistics 3, 4
Statistics 3, 4Statistics 3, 4
Statistics 3, 4
 
Chapter04
Chapter04Chapter04
Chapter04
 
Chapter04
Chapter04Chapter04
Chapter04
 
3.3 Measures of relative standing and boxplots
3.3 Measures of relative standing and boxplots3.3 Measures of relative standing and boxplots
3.3 Measures of relative standing and boxplots
 
Lect 3 background mathematics for Data Mining
Lect 3 background mathematics for Data MiningLect 3 background mathematics for Data Mining
Lect 3 background mathematics for Data Mining
 
Chapter08
Chapter08Chapter08
Chapter08
 
Normal distribution
Normal distributionNormal distribution
Normal distribution
 
Chapter 5 part1- The Sampling Distribution of a Sample Mean
Chapter 5 part1- The Sampling Distribution of a Sample MeanChapter 5 part1- The Sampling Distribution of a Sample Mean
Chapter 5 part1- The Sampling Distribution of a Sample Mean
 
Measures of Dispersion.pptx
Measures of Dispersion.pptxMeasures of Dispersion.pptx
Measures of Dispersion.pptx
 
estimation
estimationestimation
estimation
 
Estimation
EstimationEstimation
Estimation
 
Stats chapter 9
Stats chapter 9Stats chapter 9
Stats chapter 9
 
Chapter3
Chapter3Chapter3
Chapter3
 

Recently uploaded

Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...Bertram Ludäscher
 
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...Klinik Aborsi
 
jll-asia-pacific-capital-tracker-1q24.pdf
jll-asia-pacific-capital-tracker-1q24.pdfjll-asia-pacific-capital-tracker-1q24.pdf
jll-asia-pacific-capital-tracker-1q24.pdfjaytendertech
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRajesh Mondal
 
DAA Assignment Solution.pdf is the best1
DAA Assignment Solution.pdf is the best1DAA Assignment Solution.pdf is the best1
DAA Assignment Solution.pdf is the best1sinhaabhiyanshu
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabiaahmedjiabur940
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Klinik kandungan
 
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATIONLakpaYanziSherpa
 
sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444saurabvyas476
 
Displacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second DerivativesDisplacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second Derivatives23050636
 
Predictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting TechniquesPredictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting TechniquesBoston Institute of Analytics
 
Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024patrickdtherriault
 
Seven tools of quality control.slideshare
Seven tools of quality control.slideshareSeven tools of quality control.slideshare
Seven tools of quality control.slideshareraiaryan448
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxParas Gupta
 
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证acoha1
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareGraham Ware
 

Recently uploaded (20)

Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
 
jll-asia-pacific-capital-tracker-1q24.pdf
jll-asia-pacific-capital-tracker-1q24.pdfjll-asia-pacific-capital-tracker-1q24.pdf
jll-asia-pacific-capital-tracker-1q24.pdf
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
DAA Assignment Solution.pdf is the best1
DAA Assignment Solution.pdf is the best1DAA Assignment Solution.pdf is the best1
DAA Assignment Solution.pdf is the best1
 
Abortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
Abortion pills in Doha {{ QATAR }} +966572737505) Get CytotecAbortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
Abortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
 
Abortion pills in Jeddah |+966572737505 | get cytotec
Abortion pills in Jeddah |+966572737505 | get cytotecAbortion pills in Jeddah |+966572737505 | get cytotec
Abortion pills in Jeddah |+966572737505 | get cytotec
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotecAbortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
 
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
 
sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444
 
Displacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second DerivativesDisplacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second Derivatives
 
Predictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting TechniquesPredictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting Techniques
 
Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024
 
Seven tools of quality control.slideshare
Seven tools of quality control.slideshareSeven tools of quality control.slideshare
Seven tools of quality control.slideshare
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptx
 
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 

QQ-Plots.............................pdf

  • 1. Data Analysis and Statistical Methods Statistics 651 http://www.stat.tamu.edu/~suhasini/teaching.html Lecture 10 (MWF) Checking for normality of the data using the QQplot Suhasini Subba Rao
  • 2. Lecture 10 (MWF) QQplots Checking for Normality (with the empirical rule) • Suppose x1, . . . , xn is a sample from a normal distribution with mean µ and variance σ2 . • First we order them from the smallest number to the largest number: x(1), . . . , x(n). • Estimate the mean and standard deviations from the data; x̄ and s. • Plot all the observations on a number line. Locate the mean x̄ on this line and also the intervals: [x̄−s, x̄+s], [x̄−2s, x̄+2s] and [x̄−3s, x̄+3s]. • If the observations came from a normal, then – Roughly 68% of the observations should lie in the interval [x̄−s, x̄+s]. 1
  • 3. Lecture 10 (MWF) QQplots – 95% of the observations should lie in the interval [x̄ − 2s, x̄ + 2s]. – 99.7% of the observations should lie in the interval [x̄ − 3s, x̄ + 3s]. • Remember this means counting the number of points in each interval, and dividing it by the total number of observations. 2
  • 4. Lecture 10 (MWF) QQplots Example: Minimum temperatures The mean of this distribution is -10C and the standard deviation is 10C. Calculate the proportion within one standard deviation, two standard deviations etc. The mean is µ = −13.8 and σ = 9.4. 3
  • 5. Lecture 10 (MWF) QQplots • This is an extremely rough way to check for normality. • There can exist weird non-normal distributions where the following: – Roughly 68% of the observations should lie in the interval [x̄−s, x̄+s]. – 95% of the observations should lie in the interval [x̄ − 2s, x̄ + 2s]. – 99.7% of the observations should lie in the interval [x̄ − 3s, x̄ + 3s]. could be true. 4
  • 6. Lecture 10 (MWF) QQplots Motivating the QQplot • A QQplots orders the data from the smallest to the largest and plots the data against corresponding normal quantile. This allows on to check for normality of the data. Precisely: – Data X1, . . . , Xn ordered from smallest to largest X(1), . . . , X(n). – Plot X(i) against the i/n quantile of the normal distribution (omitting the first and last observations). • If the data comes from a normal distribution (with the mean and variance estimated from the data) the data (empirical quantiles) will match the normal quantiles, and plot should lie on a straightline (on the x = y line). • A QQplot has nothing to do with linear regression. The line you see in the plot is not the line of best fit. 5
  • 7. Lecture 10 (MWF) QQplots Checking for normality: The QQ plot • This plots what has been described above. • The QQplot consists of points and a straight 45 degree line. X X X X X (1) (2) (3) (4) y y y y y (1) (2) (3) (4) (5) (5) .. . . . x=y line • If the points tend to lie on the straightline, then this suggests the observations come from a normal distribution. 6
  • 8. Lecture 10 (MWF) QQplots Making a QQplot in JMP • Always use software to make a QQplot. • Analyze > Distribution. A window will pop-up. Highlight the variable, press Y, Columns and press okay. • Once the histogram pops up, click on the red triangle. Click on the Normal Quantile Plot. 7
  • 9. Lecture 10 (MWF) QQplots Example: Antarctic maximum temperatures This is the histogram and QQplot of the maximum temperatures in Anarctica. It is important to compare the black points with the red straight line (not the red dashed line). • It would appear that the maximum temperatures are close to normal. The mean of this data set is µ = 4.5 and the standard deviation is σ = 2.16. 8
  • 10. Lecture 10 (MWF) QQplots • This means we can calculate proportions using the normal distribution. • Question This month the maximum temperature is 7 degrees, what is its percentile? Answer We assume normality: P(X ≤ 7) = P(Z ≤ (7 − 4.5)/2.16) = 0.87. Assuming normality, 7C degreees is in the 87% percentile. The percentile can be checked, by using the actual data to calculate the percentile. Based on the data the proportion of temperatures less than 7 degrees is about 86.5%. Since 87% and 86.5% are very close we see that the normal distribution approximates well the distribution of maximum temperatures. 9
  • 11. Lecture 10 (MWF) QQplots Example: Antarctic minimum temperature QQplot This is the histogram and QQplot of the maximum temperatures in Anarctica. • The distribution of minimum temperatures is far from normal. The mean and standard deviation of this data set is −13.8 degrees and 9.3 respectively. 10
  • 12. Lecture 10 (MWF) QQplots • If we use normality of the data to calculate precentile corresponding to −10C P(X ≤ −10) = P(Z = −10+13.8 9.4 ) = 0.654 (about 65.4%). But, based on the data the proportion of temperatures less than −10 degrees is about 55%, which is quite different to the proportion calculated using the normal distribution. Approximating the distribution with a normal distribution is giving incorrect percentiles/probabilities. 11
  • 13. Lecture 10 (MWF) QQplots Interpretating a QQ-plot • Some experienced statisticans have shaman like powers when it comes to interpretating QQ-plots. • You don’t need them, but it is good to have a feel of them. • There are three main features you need to look for; – Left Skew. This means the distribution is not symmetric. Find the mode (the heightest point of the distribution). The right of the mode should be shorter than the left of the mode. – Right Skew. This means the distribution is not symmetric. Find the mode (the heightest point of the distribution). The right of the mode should be longer than the left of the mode. – Heavy tails. This means that the probability of large numbers if much more likely than a normal distribution. For example for a 12
  • 14. Lecture 10 (MWF) QQplots normal distribution most the observations 98% lie within the interval [x̄ − 3s, x̄ + 3s]. For a heavy tail distribution a far smaller proportion lie in this interval. 13
  • 15. Lecture 10 (MWF) QQplots Skewed distributions • A right skewed distribution (red) has a long right tail (green is normal). • For a left skewed distribution the QQplot is the mirror image along the 45 degree line (arch going upwards and towards the left). 14
  • 16. Lecture 10 (MWF) QQplots A right skewed distribution and the QQplot • This is right skewed. The QQplots has a “U”. 15
  • 17. Lecture 10 (MWF) QQplots QQplot of a left skewed distribution • The above is indicates a left skewed distribution. • The points are arched, going from the below the 45 degree line across it and down again. 16
  • 18. Lecture 10 (MWF) QQplots Heavy tail distribution • Has much thicker tails than a normal distribution (the blue are the tails of a normal and red are the tails of a thick tail). 17
  • 19. Lecture 10 (MWF) QQplots QQplot of a heavy tailed distribution • The plot is like an ‘S0 . On the left of the plot it is left of the 45 degree line and then towards the right it goes to being right of the 45 degree line. 18
  • 20. Lecture 10 (MWF) QQplots What does thick tailed distribution mean?? Look at the histogram of the following data set (size 200 observations). Look at the proportion of points outside one/two and three standard deviations of the mean (compare with 68%, 95% and 99.8%). It is a lot more than the normal distribution. Look at the tails, it is higher (thicker) than the normal distribution. 19
  • 21. Lecture 10 (MWF) QQplots The corresponding QQplot Below we make a QQplot of the above data set. The ‘S’ shape suggests the distribution has thick tails. 20
  • 22. Lecture 10 (MWF) QQplots QQplots: Some general warnings • When there are a just a few observations. It is extremely difficult to check for normality using a QQplot. The data below is generated using a normal distribution, but there are only seven observations. Making it very difficult to check for normality of the data. 21
  • 23. Lecture 10 (MWF) QQplots QQplot of the height data The heights are not normally distributed. The horizontal lines that we see is because the data is integer valued (heights are given to the nearest inch). There is a mild “U” shape which suggest some element of skewness. 22
  • 24. Lecture 10 (MWF) QQplots QQplot of the average of 5 heights This is when we take several samples each of size 5 and for each sample evaluate the sample mean. Each point of the plot corresponds to one sample mean. The sample mean not look exactly normal, but the points are closer to the x = y then the QQplot of the raw heights on the previous page. 23
  • 25. Lecture 10 (MWF) QQplots QQplot of binary data Let us return to the example of people liking apple juice. 100 people were interviewed and each person was asked whether they like apple juice or not (1=yes, 0 = no). Here is the data http://www.stat.tamu.edu/~suhasini/teaching651/apple_juice.txt. 24
  • 26. Lecture 10 (MWF) QQplots • 34% of this sample liked apple juice. This data is binary (not normal!), this is why you see the two lines. • It is clearly not normal, and you cannot make it more normal by increasing the sample size. • What does become normal is the sample proportion (which in this case is 34%) - this is due to the CLT, which we discuss in lecture 12. But only when the sample size is relatively large. 25
  • 27. Lecture 10 (MWF) QQplots Simulating data in JMP Make a new data table. Go to Table > Cols > New Columns.. > (In Column Properties select Formula) > Select Random in the new pop window and the distribution you want to simulate from. In the window above I chose Normal with mean 64.5 and standard deviation 2.5. This means that the number will draw numbers close to 64.5 with spread 2.5. 26
  • 28. Lecture 10 (MWF) QQplots Transforming Data • If the data is far from normal we often do a transformation of it to make it have less outliers and less skewed. Standard transforms for positive data {Xi} • The log transform; Yi = log Xi. The variance of the transformed observation tends to be less than the variance of the original observation (sometimes this transformation is called ‘variance stablisation’). Often used when the sample mean and sample variance of X are close to each other. 27
  • 29. Lecture 10 (MWF) QQplots • The power transform; Yi = Xβ i β 6= 0. This transformation tends to control outliers and ‘unskews’ the data. • The Box-Cox transform Xi; Yi = Xλ i − 1 λ λ 6= 0. 28
  • 30. Lecture 10 (MWF) QQplots Power transformation: Illustration • Left is a QQplot of the original data and the right is the QQplot of the square root of the data (i.e. Xi → √ Xi = X 1/2 i ). • Observe how the square root of the data is still skewed - but it is less skewed than the original data. Reducing skewness in data is very useful way of making the CLT ‘work’ for smaller sample sizes (see later). 29
  • 31. Lecture 10 (MWF) QQplots Transforming Baseball salaries The baseball salary data found here (second to last column) is extremely right skewed. https://www.stat.tamu.edu/~suhasini/teaching651/MLBSalaries. csv By taking the log transform we reduce the skew, but not completely. The left hand plot is the histogram and QQplot of the original data. The right hand plot is the histogram and QQplot of the log-transformed data. 30