Descriptive Statistics and Data Visualization


Published on

Outlines the basics of descriptive statistics

Published in: Education
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Descriptive Statistics and Data Visualization

  1. 1. Diversity in Datasets: (d) e constructing Descriptive Statistics and Data Visualization Douglas James Joubert National Institutes of Health Library
  2. 2. Outline <ul><li>Types of Scale </li></ul><ul><li>Levels of Measurement </li></ul><ul><li>Descriptive vs. Inferential Statistics </li></ul><ul><li>Univariate Analysis </li></ul><ul><li>Graphical Methods for Displaying Data </li></ul>
  3. 3. Before you Survey Consult with a Statistician Vital to your success Great way to collaborate
  4. 4. Analysis Always Follows Design Johnson (2005) Question Hypothesis Experimental Design Samples Data Analysis
  5. 5. Descriptive Statistics Location Spread (Dispersion) Shape of the Distribution Mean Mode Median SD Variance COV Skewness (+ or -) Kurtosis
  6. 6. Levels of Measurement <ul><li>The questions you ask are just as important as what is being measured </li></ul><ul><ul><li>Consult, confer, and pick apart your hypothesis </li></ul></ul><ul><li>Results are only as good as your poorest measurement </li></ul><ul><ul><li>Your measurement will never provide the absolute truth </li></ul></ul><ul><li>Try to control as much as possible to reduce error </li></ul><ul><ul><li>Random error – due to chance – either direction </li></ul></ul><ul><ul><li>Systematic error – due to bias – one direction </li></ul></ul>
  7. 7. Reducing Measurement Error Triangulate Different measures for same construct X2 X1
  8. 8. Types of Scale <ul><li>Nominal or Categorical </li></ul><ul><ul><li>Mutually exclusive group: gender, sick vs. healthy, remote user vs. library user </li></ul></ul><ul><ul><li>Used for identification purposes only </li></ul></ul><ul><ul><li>Cannot be ranked from smallest to largest </li></ul></ul><ul><li>Ordinal </li></ul><ul><ul><li>Mutually exclusive group that is also ordered in a meaningful manner </li></ul></ul><ul><ul><li>Distance between categories is unknown—you cannot say that a person with a job satisfaction of 2 is twice as satisfied as a person rated as a 1 </li></ul></ul>
  9. 9. Types of Scale <ul><li>Interval </li></ul><ul><ul><li>Ordered groups with equal intervals between any two pairs of adjacent classes </li></ul></ul><ul><ul><li>No absolute zero and you cannot compute ratios, for example, temperature </li></ul></ul><ul><li>Ratio </li></ul><ul><ul><li>Interval scale with a true absolute zero, for example, weight </li></ul></ul><ul><ul><li>You can tell how much larger or smaller one value is compared with another </li></ul></ul>
  10. 10. Hierarchy of Measurement Ratio Interval Ordinal Nominal Trochim (2001) Absolute Zero Distance is meaningful Characteristics can be ordered Classification is arbitrary
  11. 11. Descriptive vs. Inferential Statistics <ul><li>Descriptive (Summary) statistics describe or characterize data in such a way that none of the original information is lost or distorted 1 </li></ul><ul><li>Inferential statistics allow one to draw conclusions about a population based on data obtained from a sample </li></ul>Munro (2002) S1 S2 S3 S4 S5 S6 ? ? ? ? ? ? Sample Population
  12. 12. Univariate Descriptive Analysis <ul><li>Allows one to examine each variable separately to check for data inconsistencies, variability of variables </li></ul><ul><li>Also allows one to check statistical assumptions about the shape of the distribution before moving on to more complex analysis </li></ul><ul><li>Univariate descriptive statistics can also be used to determine central tendency, variability, skewness, and kurtosis </li></ul>
  13. 13. Graphical Methods for Displaying Data <ul><li>Frequency Distributions </li></ul><ul><li>Histograms </li></ul><ul><li>Plots </li></ul><ul><li>Pareto Charts </li></ul><ul><li>Boxplots </li></ul><ul><li>Error Bar Charts </li></ul>
  14. 14. Frequency distributions <ul><li>Frequency distributions are a nice tool for categorizing data into meaningful groups </li></ul><ul><li>Organizing data in tabular form using classes or frequencies </li></ul><ul><li>Two main types: </li></ul><ul><ul><li>Categorical: qualitative data such as gender, treatment group or not, religious affiliation </li></ul></ul><ul><ul><li>Ungrouped or grouped quantitative data </li></ul></ul>
  15. 15. Categorical Frequency distributions A O B A AB AB A A B B O O O A B AB 16 Total 3 AB 4 O 4 B 5 A Frequency f Class
  16. 16. Ungrouped Frequency distributions 161 155 103 103 Birth weight data in (oz) 101 100 98 98 89 94 94 93 91 88 88 67 64 64 58 32
  17. 17. Ungrouped Frequency distributions … 1 93 1 91 2 88 1 67 2 64 1 58 1 32 Count (Frequency f) Birth weight
  18. 18. Grouped Frequency Distribution <ul><li>Grouped frequency distribution is obtained by constructing classes (intervals) for the data </li></ul><ul><li>If the difference between minimum and maximum values exceed 15 then you need to divide the data into classes </li></ul><ul><li>Should have a minimum of 5 classes and a maximum of 20 </li></ul><ul><li>Histogram is a graphical representation of a frequency distribution </li></ul>
  19. 19. Grouped Frequency Distribution <ul><li>Typically grouped frequency distributions will contain: </li></ul><ul><ul><li>The frequency of the value within each category </li></ul></ul><ul><ul><li>Relative frequency: The percentage of values within each category based on the total number of cases </li></ul></ul><ul><ul><li>Valid percent is the percentage of cases in each category based on non-missing scores </li></ul></ul><ul><ul><li>Cumulative frequency: sum of the frequencies for all values at or below the given value </li></ul></ul><ul><ul><li>Cumulative relative frequency: sum of the relative frequencies for all values at or below the given value </li></ul></ul>
  20. 20. Grouped Frequency Distribution of CA patients *=(E2/ $E$ 8)*100, in Excel to force absolute reference 1.00 .1463 .1498 .2439 .2055 .2473 0.0696 rf* 287 245 202 132 73 2 cf 287 Total .9997 42 More .8534 43 40 – 50 .7036 70 30 – 40 .4597 59 20 – 30 .2542 71 10 – 20 .0696 2 0 – 10 crf Frequency Age
  21. 21. Table Tips <ul><li>Use tables to highlight major facts </li></ul><ul><li>Keep it simple – tables are usually intended to demystify your data, not make it more difficult to understand </li></ul><ul><li>If you are using a software program to create class intervals make sure the default works with you data </li></ul><ul><li>Think of your audience – how can I convey my message without losing important data </li></ul>
  22. 22. Table Tips <ul><li>The clustering that best describes the data should be the ultimate guide </li></ul><ul><li>Too few or too many class intervals will obscure important information about your data </li></ul><ul><li>Tables used to analyzed data are rarely published </li></ul>
  23. 23. Charts <ul><li>Effective way to give the reader a snapshot of the differences and patterns in a set of data </li></ul><ul><li>Primary disadvantage to charts is that you lose the details </li></ul><ul><li>Things to consider when constructing charts </li></ul><ul><ul><li>Does my data represent a single moment in time (cross sectional) or does my data occur over time (time series) </li></ul></ul><ul><ul><li>Do I have a qualitative or quantitative variables? </li></ul></ul><ul><ul><li>If my variable is quantitative, is the variable discrete or continuous? </li></ul></ul>Munro (2002)
  24. 24. Bar Charts <ul><li>For nominal or ordinal data use simple bar charts </li></ul><ul><ul><li>Simple bar charts you will have spaces between categories </li></ul></ul><ul><li>Cluster bar charts can be used to represent univariate distributions </li></ul><ul><li>Cluster bar charts can also be stacked </li></ul>
  25. 25. Simple Bar Chart Nominal data
  26. 26. Stacked Bar Chart <ul><li>You are really just stacking two or more columns into a single new column </li></ul><ul><li>Compares the percentage that each group contributes to the total across categories </li></ul><ul><li>Want to have 100% stacked columns so you can compare the percentages in each group </li></ul>
  27. 27. Stacked Bar Chart
  28. 28. Histograms <ul><li>Best for interval and ratio data </li></ul><ul><li>Represent percentages rather than counts </li></ul><ul><li>Each histogram has total area of 100% </li></ul><ul><li>Since this is a range of values no gaps between bars </li></ul><ul><li>From a descriptive standpoint allows one to look at the distribution of variables </li></ul><ul><li>Consider grouping the data if range > 15 </li></ul><ul><li>Height of the vertical axis is important </li></ul>
  29. 29. Histogram of Family Terms
  30. 30. Histogram Std Err Bars Normal Dist Fit
  31. 31. Histogram: SEM and Normal Distributions <ul><li>Standard error of the mean is the estimate of how much we would expect the mean to vary in a population, given repeated samples </li></ul><ul><li>Fit distribution (Normal) estimates the parameters of the normal distribution based on the analysis sample </li></ul>
  32. 32. Pareto Charts <ul><li>Pareto chart is a special type of histogram that is arranged from largest to smallest </li></ul><ul><li>Allows one to determine which values are least important and which values are more important </li></ul><ul><li>Pareto charts combines a bar chart displaying percentages of categories in the data with a line plot showing cumulative percentages of the categories </li></ul>
  33. 33. Pareto Chart SAS (1990)
  34. 34. 2-Way Comparative Pareto Chart SAS (1990)
  35. 35. Overlay Chart <ul><li>Similar to a scatterplot but…your are only looking at one variable </li></ul>SAS (1989–2004)
  36. 36. Plots <ul><li>Scatterplots look at the relationship between two or more variables </li></ul><ul><li>Great way to identify outliers </li></ul><ul><li>Typically the Y-axis is the DV and X-axis the IV </li></ul><ul><li>Using a control variable allows one to identify different groups </li></ul><ul><li>For example, the relationship between bp and weight, and controlling for smoking vs. non-smoking </li></ul>
  37. 37. Plots <ul><li>Scatterplots look at the relationship between two or more variables </li></ul><ul><li>Great way to identify outliers </li></ul><ul><li>Typically the Y-axis is the DV and X-axis the IV </li></ul><ul><li>Using a control variable allows one to identify different groups </li></ul><ul><li>For example, the relationship between bp and weight, and controlling for smoking vs. non-smoking </li></ul><ul><li>Why? Because we are controlling for some factor </li></ul>
  38. 38. Simple Scatterplot SAS (1989–2004)
  39. 39. Simple Scatterplot <ul><li>In correlation, this is the least-square line (scary math, but very important) </li></ul>SAS (1989–2004)
  40. 40. Box-and-Whisker Plots <ul><li>A graphical method based on percentiles </li></ul><ul><li>Useful for visualizing the distribution of a variable </li></ul><ul><li>Simultaneously displays the median, the IQR, and the smallest and largest values for a group </li></ul><ul><li>More compact than a histogram but less revealing </li></ul><ul><li>Good tool for identifying outliers and extreme values </li></ul><ul><li>Two common types: Outlier Box Plot and a Quantile Box Plot </li></ul>
  41. 41. Outlier Box Plot Possible Outliers IQR Largest value not an outlier Smallest value not an outlier 75th 25th 50 th (median)
  42. 42. Quantile Box Plot
  43. 43. Contact Information <ul><li>Douglas J. Joubert, MLIS </li></ul><ul><li>Biomedical Informationist </li></ul><ul><li>National Institutes of Health Library </li></ul><ul><li>Bldg. 10, Room 1L09A </li></ul><ul><li>Bethesda, MD 20906-1150 </li></ul><ul><li>Phone: 301.594.6282 </li></ul><ul><li>Fax: 301.402.0254 </li></ul><ul><li>E-mail: </li></ul><ul><li>E-mail: </li></ul><ul><li> </li></ul>
  44. 44. References <ul><li>Johnson, Laura Lee Ph.D (2004). Principles and Practices of Clinical Research (Lecture), NIH. </li></ul><ul><li>SAS (1990). Common causes of failure during the fabrication of integrated circuits. Data from &quot;Selected SAS/QC Software Examples, Release 6.06, SAS Users Group International Conference, April 2, 1990 pg 383. </li></ul><ul><li>Munro, B. H. (2001). Statistical methods for health care research (4th ed.). Philadelphia: Lippincott Williams & Wilkins. </li></ul><ul><li>SAS Institute Inc. (1989-2004). SAS Help Files. Cary: North Carolina. </li></ul>