Successfully reported this slideshow.
Upcoming SlideShare
×

of

Upcoming SlideShare
How to run ANOVA on SPSS
Next

10 Likes

Share

# Exploratory Data Analysis - Checking For Normality

Exploratory Data Analysis - Checking For Normality

See all

See all

### Exploratory Data Analysis - Checking For Normality

1. 1. FK6163 Explore & Summarise Dr Azmi Mohd Tamil Dept of Community Health Universiti Kebangsaan Malaysia ©drtamil@gmail.com 2012
2. 2. Introduction Method of Exploring and Summarising Data differs According to Types of Variables ©drtamil@gmail.com 2012
3. 3. Dependent/Independent Independent Variables Food Intake Frequency of Exercise Obesity Dependent Variable ©drtamil@gmail.com 2012
5. 5. Explore 4 Itis the first step in the analytic process 4 to explore the characteristics of the data 4 to screen for errors and correct them 4 to look for distribution patterns - normal distribution or not 4 May require transformation before further analysis using parametric methods 4 Or may need analysis using non-parametric techniques ©drtamil@gmail.com 2012
6. 6. Data Screening PARITY Frequency Percent 4 By running Valid 1 67 30.7 frequencies, we may 2 44 20.2 3 36 16.5 detect inappropriate 4 22 10.1 responses 5 21 9.6 6 8 3.7 4 How many in the 7 3 1.4 audience have 15 8 7 3.2 children and 9 5 2.3 10 3 1.4 currently pregnant 11 1 .5 with the 16th? 15 1 .5 Total 218 100.0 ©drtamil@gmail.com 2012
7. 7. Data Screening 4 See whether the data make sense or not. 4 E.g. Parity 10 but age only 25. ©drtamil@gmail.com 2012
10. 10. Data Screening 4 By looking at measures of central tendency and range, we can also detect abnormal values for quantitative data Descriptive Statistics Std. N Minimum Maximum Mean Deviation Pre-pregnancy weight 184 32 484 53.05 33.37 Valid N (listwise) 184 ©drtamil@gmail.com 2012
11. 11. Interpreting the Box Plot Outlier Largest non-outlier The whiskers extend to 1.5 times the box width from both ends Upper quartile of the box and ends at an observed value. Three times the box Median width marks the boundary between "mild" and "extreme" Lower quartile outliers. "mild" = closed dots Smallest non-outlier Outlier"extreme"= open dots ©drtamil@gmail.com 2012
12. 12. Data Screening 600 4 We can also make 500 73 use of 400 graphical tools such 300 as the box 200 plot to detect 100 181 211 198 141 wrong 0 data entry N= 184 Pre-pregnancy weight ©drtamil@gmail.com 2012
13. 13. Data Cleaning 4 Identify the extreme/wrong values 4 Check with original data source – i.e. questionnaire 4 If incorrect, do the necessary correction. 4 Correction must be done before transformation, recoding and analysis. ©drtamil@gmail.com 2012
14. 14. Parameters of Data Distribution 4 Mean – central value of data 4 Standard deviation – measure of how the data scatter around the mean 4 Symmetry (skewness) – the degree of the data pile up on one side of the mean 4 Kurtosis – how far data scatter from the mean ©drtamil@gmail.com 2012
15. 15. Normal distribution 4 The Normal distribution is represented by a family of curves defined uniquely by two parameters, which are the mean and the standard deviation of the population. 4 The curves are always symmetrically bell shaped, but the extent to which the bell is compressed or flattened out depends on the standard deviation of the population. 4 However, the mere fact that a curve is bell shaped does not mean that it represents a Normal distribution, because other distributions may have a similar sort of shape. ©drtamil@gmail.com 2012
16. 16. Normal distribution 4 If the observations follow a 99.7% Normal distribution, a range 95.4% covered by one standard 68.3% deviation above the mean and one standard deviation below it includes about 68.3% of the observations; 4 a range of two standard deviations above and two below (+ 2sd) about 95.4% of the observations; and 4 of three standard deviations above and three below (+ 3sd) about 99.7% of the observations ©drtamil@gmail.com 2012
17. 17. Normality 4 Why bother with normality?? 4 Because it dictates the type of analysis that you can run on the data ©drtamil@gmail.com 2012
18. 18. Normality-Why? Parametric Qualitative Quantitative Normally distributed data Student's t Test Dichotomus Qualitative Quantitative Normally distributed data ANOVA Polinomial Quantitative Quantitative Repeated measurement of the Paired t Test same individual & item (e.g. Hb level before & after treatment). Normally distributed data Quantitative - Quantitative - Normally distributed data Pearson Correlation continous continous & Linear Regresssion ©drtamil@gmail.com 2012
19. 19. Normality-Why? Non-parametric Qualitative Quantitative Data not normally distributed Wilcoxon Rank Sum Dichotomus Test or U Mann- Whitney Test Qualitative Quantitative Data not normally distributed Kruskal-Wallis One Polinomial Way ANOVA Test Quantitative Quantitative Repeated measurement of the Wilcoxon Rank Sign same individual & item Test Quantitative - Quantitative - Data not normally distributed Spearman/Kendall continous/ordina continous Rank Correlation l ©drtamil@gmail.com 2012
20. 20. Normality-How? 4 Explored statistically 4 Explored graphically • Kolmogorov-Smirnov • Histogram statistic, with • Stem & Leaf Lilliefors significance • Box plot level and the • Normal probability Shapiro-Wilks plot statistic • Detrended normal • Skew ness (0) plot • Kurtosis (0) – + leptokurtic – 0 mesokurtik – - platykurtic ©drtamil@gmail.com 2012
21. 21. Kolmogorov- Smirnov 4 In the 1930’s, Andrei Nikolaevich Kolmogorov (1903-1987) and N.V. Smirnov (his student) came out with the approach for comparison of distributions that did not make use of parameters. 4 This is known as the Kolmogorov- Smirnov test. ©drtamil@gmail.com 2012
22. 22. Skew ness 4 Skewed to the right indicates the presence of large extreme values 4 Skewed to the left indicates the presence of small extreme values ©drtamil@gmail.com 2012
23. 23. Kurtosis 4 For symmetrical distribution only. 4 Describes the shape of the curve 4 Mesokurtic - average shaped 4 Leptokurtic - narrow & slim 4 Platikurtic - flat & wide ©drtamil@gmail.com 2012
24. 24. Skew ness & Kurtosis 4 Skew ness ranges from -3 to 3. 4 Acceptable range for normality is skew ness lying between -1 to 1. 4 Normality should not be based on skew ness alone; the kurtosis measures the “peak ness” of the bell-curve (see Fig. 4). 4 Likewise, acceptable range for normality is kurtosis lying between -1 to 1. ©drtamil@gmail.com 2012
26. 26. Normality - Examples Graphically 60 50 40 30 20 10 Std. Dev = 5.26 Mean = 151.6 0 N = 218.00 140.0 145.0 150.0 155.0 160.0 165.0 142.5 147.5 152.5 157.5 162.5 167.5 Height ©drtamil@gmail.com 2012
27. 27. Q&Q Plot 4 This plot compares the quintiles of a data distribution with the quintiles of a standardised theoretical distribution from a specified family of distributions (in this case, the normal distribution). 4 If the distributional shapes differ, then the points will plot along a curve instead of a line. 4 Take note that the interest here is the central portion of the line, severe deviations means non-normality. Deviations at the “ends” of the curve signifies the existence of outliers. ©drtamil@gmail.com 2012
28. 28. Normality - Examples Graphically Normal Q-Q Plot of Height 3 2 1 0 Detrended Normal Q-Q Plot of Height Expected Normal -1 .6 .5 -2 .4 -3 .3 130 140 150 160 170 .2 Observed Value Dev from Normal .1 0.0 -.1 -.2 130 140 150 160 170 Observed Value ©drtamil@gmail.com 2012
29. 29. Normal distribution Mean=median=mode ©drtamil@gmail.com 2012
30. 30. Normality - Examples Statistically Descriptives Statistic Std. Error Height Mean 151.65 .356 95% Confidence Lower Bound 150.94 Interval for Mean Upper Bound Normal distribution 152.35 Mean=median=mode 5% Trimmed Mean 151.59 Median 151.50 Variance 27.649 Skewness & kurtosis Std. Deviation 5.258 Minimum 139 within +1 Maximum 168 Range 29 Interquartile Range 8.00 p > 0.05, so normal Skewness .148 .165 distribution Kurtosis .061 .328 Tests of Normality a Kolmogorov-Smirnov Shapiro-Wilks; only if Statistic df Sig. sample size less than 100. Height .060 218 .052 a. Lilliefors Significance Correction ©drtamil@gmail.com 2012
31. 31. K-S Test ©drtamil@gmail.com 2012
32. 32. K-S Test 4 very sensitive to the sample sizes of the data. 4 For small samples (n<20, say), the likelihood of getting p<0.05 is low 4 for large samples (n>100), a slight deviation from normality will result in being reported as abnormal distribution ©drtamil@gmail.com 2012
33. 33. Guide to deciding on normality ©drtamil@gmail.com 2012
34. 34. Normality Transformation Normal Q-Q Plot of PARITY Normal Q-Q Plot of PARITY 33 22 11 Normal Q-Q Plot of LN_PARIT Normal Q-Q Plot of LN_PARIT 00 3 3 Expected Normal Expected Normal -1 -1 2 2 -2 -2 00 22 44 66 88 10 10 12 12 14 14 16 16 Observed Value Observed Value 1 1 0 0 Expected Normal Expected Normal -1 -1 -2 -2 -.5 -.5 0.0 0.0 .5 .5 1.0 1.0 1.5 1.5 2.0 2.0 2.5 2.5 3.0 3.0 Observed Value Observed Value ©drtamil@gmail.com 2012
35. 35. TYPES OF TRANSFORMATIONS Square root Logarithm Inverse Reflect and square Reflect and logarithm Reflect and inverse root ©drtamil@gmail.com 2012
36. 36. Summarise 4 Summarise a large set of data by a few meaningful numbers. 4 Single variable analysis • For the purpose of describing the data • Example; in one year, what kind of cases are treated by the Psychiatric Dept? • Tables & diagrams are usually used to describe the data • For numerical data, measures of central tendency & spread is usually used ©drtamil@gmail.com 2012
37. 37. Frequency Table Race F % Malay 760 95.84% Chinese 5 0.63% Indian 0 0.00% Others 28 3.53% TOTAL 793 100.00% •Illustrates the frequency observed for each category ©drtamil@gmail.com 2012
38. 38. Frequency Distribution Table • > 20 observations, best Umur Bil % presented as a frequency 0-0.99 25 3.26% 1-4.99 78 10.18% distribution table. 5-14.99 140 18.28% •Columns divided into class & 15-24.99 126 16.45% 25-34.99 112 14.62% frequency. 35-44.99 90 11.75% •Mod class can be determined 45-54.99 66 8.62% 55-64.99 60 7.83% using such tables. 65-74.99 50 6.53% 75-84.99 16 2.09% 85+ 3 0.39% JUMLAH 766 ©drtamil@gmail.com 2012
40. 40. Measures of Central Tendency 4Mean 4Mode 4Median ©drtamil@gmail.com 2012
41. 41. Measures of Variability 4Standard deviation 4Inter-quartiles 4Skew ness & kurtosis ©drtamil@gmail.com 2012
42. 42. Mean 4 theaverage of the data collected 4 To calculate the mean, add up the observed values and divide by the number of them. 4A major disadvantage of the mean is that it is sensitive to outlying points ©drtamil@gmail.com 2012
43. 43. Mean: Example 412, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58 4Total of x = 648 4n= 20 4Mean = 648/20 = 32.4 ©drtamil@gmail.com 2012
44. 44. Measures of variation - standard deviation 4 tells us how much all the scores in a dataset cluster around the mean. A large S.D. is indicative of a more varied data scores. 4 a summary measure of the differences of each observation from the mean. 4 If the differences themselves were added up, the positive would exactly balance the negative and so their sum would be zero. 4 Consequently the squares of the differences are added. ©drtamil@gmail.com 2012
46. 46. sd: Example x x 4 12, 13, 17, 21, 24, 24, (x-mean)^2 (x-mean)^2 12 416.16 32 0.16 26, 27, 27, 30, 32, 35, 13 376.36 35 6.76 37, 38, 41, 43, 44, 46, 17 237.16 37 21.16 53, 58 21 129.96 38 31.36 24 70.56 41 73.96 4 Mean = 32.4; n = 20 24 70.56 43 112.36 4 Total of(x-mean)2 26 40.96 44 134.56 = 3050.8 27 29.16 46 184.96 27 29.16 53 424.36 4 Variance = 3050.8/19 30 5.76 58 655.36 = 160.5684 TOTAL 1405.8 TOTAL 1645 4 sd = 160.56840.5=12.67 ©drtamil@gmail.com 2012
47. 47. Median 4 the ranked value that lies in the middle of the data 4 the point which has the property that half the data are greater than it, and half the data are less than it. 4 if n is even, average the n/2th largest and the n/2 + 1th largest observations 4 "robust" to outliers ©drtamil@gmail.com 2012
48. 48. Median: 4 12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58 4 (20+1)/2 = 10th which is 30, 11th is 32 4 Therefore median is (30 + 32)/2 = 31 ©drtamil@gmail.com 2012
49. 49. Measures of variation - quartiles 4 The range is very susceptible to what are known as outliers 4A more robust approach is to divide the distribution of the data into four, and find the points below which are 25%, 50% and 75% of the distribution. These are known as quartiles, and the median is the second quartile. ©drtamil@gmail.com 2012
50. 50. Quartiles 4 12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58 4 25th percentile 24; (24+24)/2 4 50th percentile 31; (30+32)/2 ; = median 4 75th percentile 42.5; (41+43)/2 ©drtamil@gmail.com 2012
51. 51. Mode 4 The most frequent occurring number. E.g. 3, 13, 13, 20, 22, 25: mode = 13. 4 It is usually more informative to quote the mode accompanied by the percentage of times it happened; e.g., the mode is 13 with 33% of the occurrences. ©drtamil@gmail.com 2012
52. 52. Mode: Example 4 12,13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58 4 Modes are 24 (10%) & 27 (10%) ©drtamil@gmail.com 2012
53. 53. Mean or Median? 4 Which measure of central tendency should we use? 4 if the distribution is normal, the mean+sd will be the measure to be presented, otherwise the median+IQR should be more appropriate. ©drtamil@gmail.com 2012
54. 54. Not Normal distribution; Normal distribution; Use Median & IQR Use Mean+SD ©drtamil@gmail.com 2012
55. 55. Presentation Qualitative & Quantitative Data Charts & Tables ©drtamil@gmail.com 2012
56. 56. Presentation Qualitative Data ©drtamil@gmail.com 2012
57. 57. Graphing Categorical Data: Univariate Data Categorical Data Graphing Data Tabulating Data The Summary Table Pie Charts CD S avings B onds Bar Charts Pareto Diagram S toc ks 45 120 40 0 10 20 30 40 50 100 35 30 80 25 60 20 15 40 10 20 5 0 0 S toc ks B onds S avings CD ©drtamil@gmail.com 2012
58. 58. Bar Chart 80 69 60 40 20 20 Percent 11 0 Housew ife Office w ork Field w ork Type of work ©drtamil@gmail.com 2012
59. 59. Pie Chart Others Chinese Malay ©drtamil@gmail.com 2012
60. 60. Tabulating and Graphing Bivariate Categorical Data 4 Contingency tables: Table 1: Contigency table of pregnancy induced hypertension and SGA Count SGA Normal SGA Total Pregnancy induced No 103 94 197 hypertension Yes 5 16 21 Total 108 110 218 ©drtamil@gmail.com 2012
61. 61. Tabulating and Graphing Bivariate Categorical Data 120 4 Side 100 by 103 94 side 80 charts 60 40 SGA 20 Normal Count 16 0 SGA No Yes Pregnancy induced hypertension ©drtamil@gmail.com 2012
62. 62. Presentation Quantitative Data ©drtamil@gmail.com 2012
63. 63. Tabulating and Graphing Numerical Data Numerical Data 41, 24, 32, 26, 27, 27, 30, 24, 38, 21 Frequency Distributions Ordered Array Ogive 21, 24, 24, 26, 27, 27, 30, 32, 38, 41 Cumulative Distributions 120 100 80 60 40 20 0 2 144677 Area 10 20 30 40 50 60 Stem and Leaf Histograms 3 028 Display 7 6 4 1 5 4 Tables 3 2 1 Polygons 0 10 20 30 40 50 60 ©drtamil@gmail.com 2012
64. 64. Tabulating Numerical Data: Frequency Distributions 4 Sort raw data in ascending order: 12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58 4 Find range: 58 - 12 = 46 4 Select number of classes: 5 (usually between 5 and 15) 4 Compute class interval (width): 10 (46/5 then round up) 4 Determine class boundaries (limits): 10, 20, 30, 40, 50, 60 4 Compute class midpoints: 14.95, 24.95, 34.95, 44.95, 54.95 4 Count observations & assign to classes ©drtamil@gmail.com 2012
65. 65. Frequency Distributions and Percentage Distributions Data in ordered array: 12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58 Class Midpoint Freq % 10.0 - 19.9 14.95 3 15% 20.0 - 29.9 24.95 6 30% 30.0 - 39.9 34.95 5 25% 40.0 - 49.9 44.95 4 20% 50.0 - 59.9 54.95 2 10% TOTAL 20 100% ©drtamil@gmail.com 2012
66. 66. Graphing Numerical Data: The Histogram Data in ordered array: 12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58 7 6 6 5 5 Frequency 4 4 3 No Gaps 3 Between 2 2 Bars 1 0 14.95 24.95 34.95 44.95 54.95 Age Class Boundaries Class Midpoints ©drtamil@gmail.com 2012
67. 67. Graphing Numerical Data: The Frequency Polygon Data in ordered array: 12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58 7 6 5 4 3 2 1 0 14.95 24.95 34.95 44.95 54.95 Class Midpoints ©drtamil@gmail.com 2012
68. 68. Calculate Measures of Central Tendency & Spread 4 We can use frequency distribution table to calculate; • Mean • Standard Deviation • Median • Mode ©drtamil@gmail.com 2012
69. 69. Mean X= ∑ f .mp n Class Midpoint Freq freq x m.p. 4 Mean = 659/20 10.0 - 19.9 14.95 3 44.85 = 32.95 20.0 - 29.9 24.95 6 149.70 4 Compare with 32.4 30.0 - 39.9 34.95 5 174.75 from direct 40.0 - 49.9 44.95 4 179.80 calculation. 50.0 - 59.9 54.95 2 109.90 TOTAL 20 659.00 ©drtamil@gmail.com 2012
70. 70. Standard deviation 2 ( ∑ f .mp ) ∑ f .mp 2 − n s= Mid n −1 Class Point Freq f.m.p. f.mp^2 14.95 3 44.85 s2=((24634.05-(6592/20))/19) 10.0 - 19.9 670.51 s2=2920.05/19 20.0 - 29.9 24.95 6 149.70 3735.02 s2=153.69 30.0 - 39.9 34.95 5 174.75 6107.51 s = 12.4 40.0 - 49.9 44.95 4 179.80 8082.01 4 Compare with 12.67 from direct measurement. 50.0 - 59.9 54.95 2 109.90 6039.01 TOTAL 20 659.00 24634.05 ©drtamil@gmail.com 2012
71. 71. Median Class Freq 4 L1 +i *((n+1)/2) – f1 fmed 10.0 - 19.9 3 4 f1 = cumulative freq above median class 20.0 - 29.9 6 4 29.95 + 10((21/2)-9) 30.0 - 39.9 5 median class 5 40.0 - 49.9 4 4 29.95 + 15/5 = 32.95 4 From direct calculation, 50.0 - 59.9 2 median = 31 TOTAL 20 ©drtamil@gmail.com 2012
72. 72. Mode =L1 +i *(Diff1/(Diff1+Diff2)) Class Freq =19.95 + 10(3/(3+1)) =27.45 10.0 - 19.9 3 20.0 - 29.9 6 mode class 4 Compare with 30.0 - 39.9 5 modes of 24 & 27 40.0 - 49.9 4 from direct 50.0 - 59.9 2 calculation. TOTAL 20 ©drtamil@gmail.com 2012
73. 73. Graphing Bivariate Numerical Data (Scatter Plot) ©drtamil@gmail.com 2012
74. 74. Linear Regression Line ©drtamil@gmail.com 2012
75. 75. Survival Function 1.2 1.0 .8 .6 .4 C S rvival um u .2 Survival Function 0.0 Censored 0 1 2 3 4 5 6 7 DURATION ©drtamil@gmail.com 2012
76. 76. Principles of Graphical Excellence 4 Presents data in a way that provides substance, statistics and design 4 Communicates complex ideas with clarity, precision and efficiency 4 Gives the largest number of ideas in the most efficient manner 4 Almost always involves several dimensions 4 Tells the truth about the data ©drtamil@gmail.com 2012

Dec. 5, 2021
• #### JoeLianMin

Sep. 25, 2020
• #### nurulhazliana

Sep. 26, 2017

Mar. 24, 2017
• #### KennethLeong1

Aug. 10, 2016
• #### AhmedRefat

Sep. 29, 2015
• #### saenko-av

Oct. 18, 2014
• #### raraeynj

Jun. 13, 2013

Oct. 4, 2012
• #### sitinuridzaan

Sep. 27, 2012

Exploratory Data Analysis - Checking For Normality

Total views

6,099

On Slideshare

0

From embeds

0

Number of embeds

972

333

Shares

0