Looking at data
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Looking at data

on

  • 1,070 views

 

Statistics

Views

Total Views
1,070
Views on SlideShare
1,045
Embed Views
25

Actions

Likes
0
Downloads
25
Comments
0

1 Embed 25

http://sciencenglish.pbworks.com 25

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • That's really what distinguishes these from discrete numerical
  • What are some others?
  • Does everybody know what I mean when I say percentiles? What is the median? Anyone?
  • 1. Bin sizes may be altered. 2. How many people do you think are in bin 125-135? 3. Where do you think the center of the data are (what's your best guess at the average weight)? 4. On average, how far do you think a given woman is from 127 -- the center/mean?
  • Balance the Bell Curve on a point. Where is the point of balance, average mass on each side.
  • 1. Bin sizes may be altered. 2. How many people do you think are in bin 125-135? 3. Where do you think the center of the data are (what's your best guess at the average weight)? 4. On average, how far do you think a given woman is from 127 -- the center/mean?
  • SAY: within 1 standard deviation either way of the mean within 2 standard deviations of the mean within 3 standard deviations either way of the mean WORKS FOR ALL NORMAL CURVES NO MATTER HOW SKINNY OR FAT

Looking at data Presentation Transcript

  • 1. Looking at Data
  • 2. Clinical Data Example
    • 1. Kline et al. (2002)
      • The researchers analyzed data from 934 emergency room patients with suspected pulmonary embolism (PE). Only about 1 in 5 actually had PE. The researchers wanted to know what clinical factors predicted PE.
      • I will use four variables from their dataset today:
        • Pulmonary embolism (yes/no)
        • Age (years)
        • Shock index = heart rate/systolic BP
        • Shock index categories = take shock index and divide it into 10 groups (lowest to highest shock index)
  • 3. Descriptive Statistics
  • 4. Types of Variables: Overview Categorical Quantitative continuous discrete ordinal nominal binary 2 categories + more categories + order matters + numerical + uninterrupted
  • 5. Categorical Variables
    • Also known as “qualitative.”
    • Categories.
        • treatment groups
        • exposure groups
        • disease status
  • 6. Categorical Variables
    • Dichotomous (binary) – two levels
        • Dead/alive
        • Treatment/placebo
        • Disease/no disease
        • Exposed/Unexposed
        • Heads/Tails
        • Pulmonary Embolism (yes/no)
        • Male/female
  • 7. Categorical Variables
    • Nominal variables – Named categories Order doesn’t matter!
        • The blood type of a patient (O, A, B, AB)
        • Marital status
        • Occupation
  • 8. Categorical Variables
    • Ordinal variable – Ordered categories. Order matters!
        • Staging in breast cancer as I, II, III, or IV
        • Birth order—1st, 2nd, 3rd, etc.
        • Letter grades (A, B, C, D, F)
        • Ratings on a scale from 1-5
        • Ratings on: always; usually; many times; once in a while; almost never; never
        • Age in categories (10-20, 20-30, etc.)
        • Shock index categories (Kline et al.)
  • 9. Quantitative Variables
    • Numerical variables; may be arithmetically manipulated.
      • Counts
      • Time
      • Age
      • Height
  • 10. Quantitative Variables
    • Discrete Numbers – a limited set of distinct values, such as whole numbers.
        • Number of new AIDS cases in CA in a year (counts)
        • Years of school completed
        • The number of children in the family (cannot have a half a child!)
        • The number of deaths in a defined time period (cannot have a partial death!)
        • Roll of a die
  • 11. Quantitative Variables
    • Continuous Variables - Can take on any number within a defined range.
        • Time-to-event (survival time)
        • Age
        • Blood pressure
        • Serum insulin
        • Speed of a car
        • Income
        • Shock index (Kline et al.)
  • 12. Looking at Data
    •  How are the data distributed?
      • Where is the center?
      • What is the range?
      • What’s the shape of the distribution (e.g., Gaussian, binomial, exponential, skewed)?
    •  Are there “outliers”?
    •  Are there data points that don’t make sense?
  • 13. The first rule of statistics: USE COMMON SENSE! 90% of the information is contained in the graph.
  • 14. Frequency Plots (univariate)
      • Categorical variables
      • Bar Chart
      • Continuous variables
      • Box Plot
      • Histogram
  • 15. Bar Chart
    • Used for categorical variables to show frequency or proportion in each category.
    • Translate the data from frequency tables into a pictorial representation…
  • 16. Bar Chart: categorical variables no yes
  • 17. Much easier to extract information from a bar chart than from a table! Bar Chart for SI categories Number of Patients Shock Index Category 0.0 16.7 33.3 50.0 66.7 83.3 100.0 116.7 133.3 150.0 166.7 183.3 200.0 1 2 3 4 5 6 7 8 9 10
  • 18. Box plot and histograms: for continuous variables
    • To show the distribution (shape, center, range, variation) of continuous variables.
  • 19. 0.0 0.7 1.3 2.0 SI Box Plot: Shock Index Shock Index Units “ whisker” Q3 + 1.5IQR = .8+1.5(.25)=1.175 75th percentile (0.8) 25th percentile (0.55) maximum (1.7) interquartile range (IQR) = .8-.55 = .25 minimum (or Q1-1.5IQR) Outliers median (.66)
  • 20. Note the “right skew” Bins of size 0.1 0.0 8.3 16.7 25.0 0.0 0.7 1.3 2.0 Histogram of SI SI Percent
  • 21. 100 bins (too much detail)
  • 22. 2 bins (too little detail)
  • 23. Also shows the “right skew” 0.0 0.7 1.3 2.0 SI Box Plot: Shock Index Shock Index Units
  • 24. 0.0 33.3 66.7 100.0 AGE Box Plot: Age Variables Years More symmetric 75th percentile 25th percentile maximum interquartile range minimum median
  • 25. Histogram: Age Not skewed, but not bell-shaped either… 0.0 4.7 9.3 14.0 0.0 33.3 66.7 100.0 AGE (Years) Percent
  • 26. Some histograms from last year’s class (n=18) Starting with politics…
  • 27.  
  • 28.  
  • 29. Feelings about math and writing…
  • 30. Optimism…
  • 31. Measures of central tendency
    • Mean
    • Median
    • Mode
  • 32. Central Tendency
    • Mean – the average; the balancing point
      • calculation: the sum of values divided by the sample size
    In math shorthand:
  • 33. Mean: example
      • Some data :
      • Age of participants: 17 19 21 22 23 23 23 38
  • 34. Mean of age in Kline’s data
        • Means Section of AGE
        • Geometric Harmonic
        • Parameter Mean Median Mean Mean Sum Mode
        • Value 50.19334 49 46.66865 43.00606 46730 49
        • 556.9546
    0.0 4.7 9.3 14.0 0.0 33.3 66.7 100.0 Percent
  • 35. Mean of age in Kline’s data The balancing point 0.0 4.7 9.3 14.0 0.0 33.3 66.7 100.0 Percent
  • 36. Mean of Pulmonary Embolism? (Binary variable?) 19.44% (181) 80.56% (750)
  • 37. Mean
    • The mean is affected by extreme values (outliers)
    0 1 2 3 4 5 6 7 8 9 10 Mean = 3 0 1 2 3 4 5 6 7 8 9 10 Mean = 4
    • Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
  • 38. Central Tendency
    • Median – the exact middle value
    • Calculation:
    • If there are an odd number of observations, find the middle value
    • If there are an even number of observations, find the middle two values and average them.
  • 39. Median: example
      • Some data :
      • Age of participants: 17 19 21 22 23 23 23 38
    Median = (22+23)/2 = 22.5
  • 40. Median of age in Kline’s data
        • Means Section of AGE
        • Geometric Harmonic
        • Parameter Mean Median Mean Mean Sum Mode
        • Value 50.19334 49 46.66865 43.00606 46730 49
    0.0 4.7 9.3 14.0 0.0 33.3 66.7 100.0 AGE (Years) Percent
  • 41. Median of age in Kline’s data 0.0 4.7 9.3 14.0 0.0 33.3 66.7 100.0 Percent 50% of mass 50% of mass
  • 42. Does PE have a median?
    • Yes, if you line up the 0’s and 1’s, the middle number is 0.
  • 43. Median
    • The median is not affected by extreme values (outliers).
    0 1 2 3 4 5 6 7 8 9 10 Median = 3 0 1 2 3 4 5 6 7 8 9 10 Median = 3
            • SSlide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
  • 44. Central Tendency
    • Mode – the value that occurs most frequently
  • 45. Mode: example
      • Some data :
      • Age of participants: 17 19 21 22 23 23 23 38
    Mode = 23 (occurs 3 times)
  • 46. Mode of age in Kline’s data
        • Means Section of AGE
        • Geometric Harmonic
        • Parameter Mean Median Mean Mean Sum Mode
        • Value 50.19334 49 46.66865 43.00606 46730 49
  • 47. Mode of PE?
    • 0 appears more than 1, so 0 is the mode.
  • 48. Measures of Variation/Dispersion
    • Range
    • Percentiles/quartiles
    • Interquartile range
    • Standard deviation/Variance
  • 49. Range
    • Difference between the largest and the smallest observations.
  • 50. 0.0 4.7 9.3 14.0 0.0 33.3 66.7 100.0 Range of age: 94 years-15 years = 79 years AGE (Years) Percent
  • 51. Range of PE?
    • 1-0 = 1
  • 52. Quartiles 25% 25% 25% 25%
    • The first quartile, Q 1 , is the value for which 25% of the observations are smaller and 75% are larger
    • Q 2 is the same as the median (50% are smaller, 50% are larger)
    • Only 25% of the observations are greater than the third quartile
    Q1 Q2 Q3
  • 53. Interquartile Range
    • Interquartile range = 3 rd quartile – 1 st quartile = Q 3 – Q 1
  • 54. Interquartile Range: age Median (Q2) maximum minimum Q1 Q3 25% 25% 25% 25% 15 35 49 65 94 Interquartile range = 65 – 35 = 30
  • 55.
    • Average (roughly) of squared deviations of values from the mean
    Variance
  • 56. Why squared deviations?
    • Adding deviations will yield a sum of 0.
    • Absolute values are tricky!
    • Squares eliminate the negatives.
    • Result:
      • Increasing contribution to the variance as you go farther from the mean.
  • 57. Standard Deviation
    • Most commonly used measure of variation
    • Shows variation about the mean
    • Has the same units as the original data
  • 58. Calculation Example: Sample Standard Deviation Age data (n=8) : 17 19 21 22 23 23 23 38 n = 8 Mean = X = 23.25
  • 59. 0.0 4.7 9.3 14.0 0.0 33.3 66.7 100.0 AGE (Years) Percent Std. dev is a measure of the “average” scatter around the mean. Estimation method: if the distribution is bell shaped, the range is around 6 SD, so here rough guess for SD is 79/6 = 13
  • 60. Std. Deviation age
        • Variation Section of AGE
        • Standard
        • Parameter Variance Deviation
        • Value 333.1884 18.25345
  • 61. 0.0 62.5 125.0 187.5 250.0 0.0 0.5 1.0 1.5 2.0 Std Dev of Shock Index SI Count Estimation method: if the distribution is bell shaped, the range is around 6 SD, so here rough guess for SD is 1.4/6 =.23 Std. dev is a measure of the “average” scatter around the mean.
  • 62. Std. Deviation SI
        • Variation Section of SI
        • Standard Std Error Interquartile
        • Parameter Variance Deviation of Mean Range Range
        • Value 4.155749E-02 0.2038566 6.681129E-03 0.2460432 1.430856
  • 63. Std. Dev of binary variable, PE Std. dev is a measure of the “average” scatter around the mean. 19.44% 80.56%
  • 64. Std. Deviation PE
        • Variation Section of PE
        • Standard
        • Parameter Variance Deviation
        • Value 0.156786 0.3959621
  • 65. Comparing Standard Deviations Mean = 15.5 S = 3.338 11 12 13 14 15 16 17 18 19 20 21 11 12 13 14 15 16 17 18 19 20 21 Data B Data A Mean = 15.5 S = 0.926 11 12 13 14 15 16 17 18 19 20 21 Mean = 15.5 S = 4.570 Data C
    • SSlide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
  • 66.
    • Regardless of how the data are distributed, a certain percentage of values must fall within K standard deviations from the mean:
    Bienaym é- Chebyshev Rule within At least (1 - 1/1 2 ) = 0% …….….. k=1 ( μ ± 1 σ ) (1 - 1/2 2 ) = 75% …........ k=2 ( μ ± 2 σ ) (1 - 1/3 2 ) = 89% ……….... k=3 ( μ ± 3 σ ) Note use of  (sigma) to represent “standard deviation.” Note use of  (mu) to represent “mean”.
  • 67. Symbol Clarification
    • S = Sample standard deviation (example of a “sample statistic”)
    •  = Standard deviation of the entire population (example of a “population parameter”) or from a theoretical probability distribution
    • X = Sample mean
    • µ = Population or theoretical mean
  • 68. **The beauty of the normal curve: No matter what  and  are, the area between  -  and  +  is about 68%; the area between  -2  and  +2  is about 95%; and the area between  -3  and  +3  is about 99.7%. Almost all values fall within 3 standard deviations.
  • 69. 68-95-99.7 Rule 68% of the data 95% of the data 99.7% of the data
  • 70. Summary of Symbols
    • S 2 = Sample variance
    • S = Sample standard dev
    •  2 = Population (true or theoretical) variance
    •  = Population standard dev.
    • X = Sample mean
    • µ = Population mean
    • IQR = interquartile range (middle 50%)
  • 71. Examples of bad graphics
  • 72. What’s wrong with this graph? from : ER Tufte. The Visual Display of Quantitative Information. Graphics Press, Cheshire, Connecticut, 1983, p.69
  • 73. From: Visual Revelations: Graphical Tales of Fate and Deception from Napoleon Bonaparte to Ross Perot Wainer, H. 1997, p.29. Notice the X-axis
  • 74. Correctly scaled X-axis…
  • 75. Report of the Presidential Commission on the Space Shuttle Challenger Accident , 1986 (vol 1, p. 145) The graph excludes the observations where no O-rings failed.
  • 76.
    • http:// www.math.yorku.ca /SCS/Gallery/
    Smooth curve at least shows the trend toward failure at high and low temperatures…
  • 77. Even better: graph all the data (including non-failures) using a logistic regression model Tappin, L. (1994). "Analyzing data relating to the Challenger disaster". Mathematics Teacher , 87, 423-426
  • 78. What’s wrong with this graph? from : ER Tufte. The Visual Display of Quantitative Information. Graphics Press, Cheshire, Connecticut, 1983, p.74
  • 79.  
  • 80. What’s the message here? Diagraphics II , 1994
  • 81. Diagraphics II , 1994
  • 82. For more examples…
    • http:// www.math.yorku.ca /SCS/Gallery/
  • 83. “Lying” with statistics
    • More accurately, misleading with statistics…
  • 84. Example 1: projected statistics
    • Lifetime risk of melanoma:
    • 1935: 1/1500
    • 1960: 1/600
    • 1985: 1/150
    • 2000: 1/74
    • 2006: 1/60
    • http://www.melanoma.org/mrf_facts.pdf
  • 85. Example 1: projected statistics
    • How do you think these statistics are calculated?
    • How do we know what the lifetime risk of a person born in 2006 will be?
  • 86. Example 1: projected statistics
    • Interestingly, a clever clinical researcher recently went back and calculated (using SEER data) the actual lifetime risk (or risk up to 70 years) of melanoma for a person born in 1935.
    • The answer?
    • Closer to 1/150 (one order of magnitude off)
    • (Martin Weinstock of Brown University, AAD conference 2006)
  • 87. Example 2: propagation of statistics
    • In many papers and reviews of eating disorders in women athletes, authors cite the statistic that 15 to 62% of female athletes have disordered eating.
    • I’ve found that this statistic is attributed to about 50 different sources in the literature and cited all over the place with or without citations...
  • 88. For example…
    • In a recent review (Hobart and Smucker, The Female Athlete Triad, American Family Physician, 2000):
    • “ Although the exact prevalence of the female athlete triad is unknown, studies have reported disordered eating behavior in 15 to 62 percent of female college athletes.”
    • No citations given.
  • 89. And…
    • Fact Sheet on eating disorders:
    • “ Among female athletes, the prevalence of eating disorders is reported to be between 15% and 62%.” Citation given: Costin, Carolyn. (1999) The Eating Disorder Source Book: A comprehensive guide to the causes, treatment, and prevention of eating disorders. 2 nd edition. Lowell House: Los Angeles.
  • 90. And…
    • From a Fact Sheet on disordered eating from a college website:
    • “ Eating disorders are significantly higher (15 to 62 percent) in the athletic population than the general population.”
    • No citation given.
  • 91. And…
    • “ Studies report between 15% and 62% of college women engage in problematic weight control behaviors (Berry & Howe, 2000).” (in The Sport Journal , 2004)
    • Citation: Berry, T.R. & Howe, B.L. (2000, Sept). Risk factors for disordered eating in female university athletes. Journal of Sport Behavior, 23(3), 207-219.
  • 92. And…
    • 1999 NY Times article
    • “But informal surveys suggest that 15 percent to 62 percent of female athletes are affected by disordered behavior that ranges from a preoccupation with losing weight to anorexia or bulimia.”
  • 93. And
    • “ It has been estimated that the prevalence of disordered eating in female athletes ranges from 15% to 62%.” ( in Journal of General Internal Medicine   15  (8), 577-590.) Citations:
    • Steen SN. The competitive athlete. In: Rickert VI, ed. Adolescent Nutrition: Assessment and Management . New York, NY: Chapman and Hall; 1996:223 47.
    • Tofler IR, Stryer BK, Micheli LJ. Physical and emotional problems of elite female gymnasts. N Engl J Med . 1996; 335 :281 3.
  • 94. Where did the statistics come from? The 15%: Dummer GM, Rosen LW, Heusner WW, Roberts PJ, and Counsilman JE. Pathogenic weight-control behaviors of young competitive swimmers. Physician Sportsmed 1987; 15: 75-84. The “to”: Rosen LW, McKeag DB, O’Hough D, Curley VC. Pathogenic weight-control behaviors in female athletes. Physician Sportsmed . 1986; 14: 79-86. The 62%:Rosen LW, Hough DO. Pathogenic weight-control behaviors of female college gymnasts. Physician Sportsmed 1988; 16:140-146.
  • 95. Where did the statistics come from?
    • Study design? Control group?
      • Cross-sectional survey (all)
      • No non-athlete control groups
    • Population/sample size?
      • Convenience samples
      • Rosen et al. 1986: 182 varsity athletes from two midwestern universities (basketball, field hockey, golf, running, swimming, gymnastics, volleyball, etc.)
      • Dummer et al. 1987: 486 9-18 year old swimmers at a swim camp
      • Rosen et al. 1988: 42 college gymnasts from 5 teams at an athletic conference
  • 96. Where did the statistics come from?
    • Measurement?
      • Instrument: Michigan State University Weight Control Survey
      • Disordered eating = at least one pathogenic weight control behavior:
        • Self-induced vomiting
        • fasting
        • Laxatives
        • Diet pills
        • Diuretics
        • In the 1986 survey, they required use 1/month; in the 1988 survey, they required use twice-weekly
        • In the 1988 survey, they added fluid restriction
  • 97. Where did the statistics come from?
    • Findings?
      • Rosen et al. 1986: 32% used at least one “pathogenic weight-control behavior” (ranges: 8% of 13 basketball players to 73.7% of 19 gymnasts)
      • Dummer et al. 1987: 15.4% of swimmers used at least one of these behaviors
      • Rosen et al. 1988: 62% of gymnasts used at least one of these behaviors
  • 98. References
    • http://www.math.yorku.ca/SCS/Gallery/
    • Kline et al. Annals of Emergency Medicine 2002; 39: 144-152.
    • Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
    • Tappin, L. (1994). "Analyzing data relating to the Challenger disaster". Mathematics Teacher , 87, 423-426
    • Tufte. The Visual Display of Quantitative Information. Graphics Press, Cheshire, Connecticut, 1983.
    • Visual Revelations: Graphical Tales of Fate and Deception from Napoleon Bonaparte to Ross Perot Wainer, H. 1997.