Upcoming SlideShare
×

# Data confusion (how to confuse yourself and others with data analysis)

1,049 views

Published on

Published in: Education
1 Like
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

Views
Total views
1,049
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
11
0
Likes
1
Embeds 0
No embeds

No notes for slide
• Non-Linear Relationships  - Look at the data first Influential Points - Look for outliers and large residuals.  Plot the regression model on the original data set Extrapolating - Predicting beyond the range of actual data. Lurking Variables . Unknown variables that influence both the explanatory and response variable. Lurking variables may cause a relationship to appear strong when in fact the variables are not directly related. Summary Data . Averaging a lot of data will cause the strength of a relationship to appear greater. Assuming Causation . Cause and effect can only be determined by a controlled experiment. Here, we are simply identifying a relationship exists.
• ### Data confusion (how to confuse yourself and others with data analysis)

1. 1. DATA CONFUSION How to confuse yourself and others with Data Analysis
2. 2. AGENDA FOR TODAY’S TALK <ul><li>Good Graphs – Bad Graphs </li></ul><ul><li>The Law of Averages </li></ul><ul><li>PTBD Analysis </li></ul><ul><li>Enumerative & Analytical Problems </li></ul><ul><li>PARC Analysis </li></ul><ul><li>Wrong Methods of Analysis </li></ul>
3. 3. “ There are three kinds of lies: Lies, damned lies and statistics” Attributed to Benjamin Disraeli by Mark Twain
4. 4. GOOD GRAPHS AND BAD GRAPHS
5. 5. DATA RELEVANCE <ul><li>Graphs are only as good as the data they display </li></ul><ul><li>No amount of creativity can produce good graphs from dubious data </li></ul>
6. 6. DATA CONTENT <ul><li>Don’t produce graphs from very small amounts of data </li></ul><ul><li>The human brain can grasp 1, 2 or 3 numbers without a graph </li></ul>
7. 7. RULES FOR PRODUCING GOOD GRAPHS <ul><li>KEEP IT SIMPLE AND STUPID </li></ul><ul><ul><li>Jesse Ventura </li></ul></ul><ul><li>Tell the truth – don’t distort the data </li></ul>
8. 8. GOOD GRAPHS <ul><li>Portray information without distortion </li></ul><ul><li>Contain no distracting elements </li></ul><ul><ul><li>No false third dimensions, irrelevant decoration, or colour (chartjunk) </li></ul></ul><ul><li>Use an appropriate scale </li></ul><ul><li>Label axes and tick marks properly, including measurement units </li></ul><ul><li>Have a descriptive title and/ or caption and legend </li></ul><ul><li>Have a low ink – to – information ratio </li></ul>
10. 10. BAD GRAPH GOOD GRAPH GOOD GRAPH
11. 11. GRAPHS THAT CONFUSE
12. 12. CHART JUNK
13. 13. GRAPHS THAT TELL A STORY
14. 14. HISTOGRAMS <ul><li>No meaningless gaps </li></ul><ul><li>Reasonable Choice of bins </li></ul><ul><li>Easy to choose or adjust bins </li></ul><ul><li>Good aspect ratio </li></ul><ul><li>Meaningful labels on axes </li></ul><ul><li>Appropriate labels on bin tick marks </li></ul>
15. 15. TRENDING RANDOM VARIATION “ Upward trend” “ Downturn” “ Rebound” “ Setback” “ Turnaround” “ Downward trend”
16. 16. THE LAW OF AVERAGES “ If I sit in a freezer and plunge my head into a pan of boiling chip fat. . . . . on average, I’m quite comfortable.”
17. 17. SHEWHART’S RULES FOR PRESENTATION OF DATA <ul><li>Rule One </li></ul><ul><ul><li>Data should always be presented in a way that preserves the evidence in the data </li></ul></ul><ul><li>Rule Two </li></ul><ul><ul><li>When an average, standard deviation or histogram is used to summarize data, the user should not be misled into to taking action they would not take if the data were presented in a time series </li></ul></ul>
18. 18. USING THE WRONG METHODS Descriptive Statistics: A, B, C, D Variable N Mean StDev CoefVar Minimum Maximum A 20 11.950 0.102 0.85 11.83 12.08 B 20 11.950 0.100 0.84 11.85 12.25 C 20 11.950 0.102 0.86 11.75 12.15 D 20 11.950 0.100 0.84 11.81 12.14 Process: A B C D 1 11.85 11.85 11.75 12.14 2 11.83 11.86 11.95 12.01 3 11.87 11.87 11.8 11.88 4 11.84 11.87 11.94 12.07 5 11.85 11.88 11.95 11.95 6 11.86 11.89 12 11.87 7 11.85 11.89 12.05 12.06 8 11.85 11.9 11.85 11.94 9 11.84 11.92 11.94 11.84 10 11.86 11.91 11.85 12.05 11 12.05 11.93 12.05 11.93 12 12.06 11.93 11.85 11.83 13 12.03 11.95 12.05 12.04 14 12.02 11.97 11.95 11.92 15 12.03 11.96 11.95 11.82 16 12.04 11.99 11.95 12.03 17 12.06 12 11.85 11.91 18 12.06 12 12.1 11.81 19 12.04 12.16 12 12.01 20 12.08 12.25 12.15 11.81
19. 19. NO SIGNIFICANT DIFFERENCE HERE!
20. 20. NO DIFFERENCE?!?
21. 21. ALWAYS CARRY OUT PTBD ANALYSIS P LOT T HE B ….. D OTS!
22. 22. TYPES OF STATISTICAL STUDIES <ul><li>Descriptive </li></ul><ul><li>Enumerative </li></ul><ul><li>Analytic </li></ul>
23. 23. DESCRIPTIVE STUDY <ul><li>Count all fish in barrel </li></ul><ul><li>Count number of goldfish </li></ul><ul><li>Proportion of goldfish applies to the fish population in this barrel and no other barrels of fish </li></ul>
24. 24. ENUMERATIVE STUDY <ul><li>Take a sample of fish from the barrel, and count the number of goldfish in the sample </li></ul><ul><li>Point estimate of the proportion of goldfish in the barrel population </li></ul><ul><li>Many statistical procedures do this </li></ul><ul><li>Cannot make any inference about any other barrels of fish </li></ul>
25. 25. ANALYTICAL STUDY <ul><li>Will we get the same proportion of goldfish in the future as we got in the past? </li></ul><ul><li>An analytical study allows prediction within limits </li></ul>Fish Packing Process over Time
26. 26. ANALYTICAL STUDY <ul><li>Proportion of goldfish is stable over time </li></ul><ul><li>Fish packing process is predictable within limits </li></ul><ul><li>We can expect, on average, 4 goldfish per barrel, but as many as 10 and as few as 0 in any single barrel </li></ul>
27. 27. ENUMERATIVE vs ANALYTICAL METHODS <ul><li>Enumerative methods </li></ul><ul><ul><li>seek to provide numeric summaries, confidence intervals,etc </li></ul></ul><ul><ul><li>use significance tests, ANOVA, descriptive stats, etc., assume single, stable population </li></ul></ul><ul><li>Analytical methods </li></ul><ul><ul><li>seek to understand the system under study </li></ul></ul><ul><ul><li>use primarily graphical tools such as run charts, control charts, histograms, box plots, etc </li></ul></ul><ul><ul><li>in the real world, most problems are analytical </li></ul></ul>
28. 28. “ Analysis of variance, t-tests, confidence intervals, and other statistical techniques taught in books,….., are inappropriate because they provide no basis for prediction and because they bury the information contained in the order of production.” W.E. Deming, Out of the Crisis Traditional statistical methods have their place, but are widely abused in the real world. When this is the case, statistics do more to cloud the issue than to enlighten.
29. 29. PARC ANALYSIS P ractical A ccumulated R ecords C ompilation P assive A nalysis (by) R egression C orrelations P lanning A fter R esearch C ompleted P rofound A nalysis R elying (on) C omputers note inverse relationship with C ontinuous R ecording (of) A dministrative P rocedures C onstant R epetition (of) A necdotal P erceptions
30. 30. PLANNING A PROCESS IMPROVEMENT STUDY <ul><li>Why collect the data? </li></ul><ul><li>What statistical methods for analysis? </li></ul><ul><li>What data will be collected? </li></ul><ul><li>How much data do we need? </li></ul><ul><li>How will the data be measured? </li></ul><ul><li>How good is the measurement system? </li></ul><ul><li>When and where will data be collected? </li></ul><ul><li>Who will collect the data? </li></ul><ul><li>Remember: </li></ul>
31. 31. GARBAGE IN – GARBAGE OUT
32. 32. WHAT’S SIGNIFICANT? Two-sample T for C1 vs C2 N Mean StDev SE Mean A 5 13.652 0.487 0.22 B 5 14.369 0.646 0.29 Difference = mu (C1) - mu (C2) Estimate for difference: -0.716615 95% CI for difference: (-1.551531, 0.118301) T-Test of difference = 0 (vs not =): T-Value = -1.98 P-Value = 0.083 DF = 8 Both use Pooled StDev = 0.5725 Two-sample T for C3 vs C4 N Mean StDev SE Mean A 200 13.510 0.501 0.035 A 200 13.667 0.492 0.035 Difference = mu (C3) - mu (C4) Estimate for difference: -0.157292 95% CI for difference: (-0.254935, -0.059649) T-Test of difference = 0 (vs not =): T-Value = -3.17 P-Value = 0.002 DF = 398 Both use Pooled StDev = 0.4967 Mean A = 13.7, Mean B = 14.4 Not significant? Mean A = 13.5, Mean B = 13.7 Significant?
33. 33. WHAT SHOULD I DO WITH OUTLIERS? <ul><li>Data point far away from the rest of the data </li></ul><ul><li>Don’t remove outliers to make data “look good” </li></ul><ul><li>Do you know why it is different? </li></ul><ul><li>If you do, remove it. If you don’t, leave it in </li></ul><ul><li>Could have a big impact on the analysis </li></ul><ul><li>Re – run analysis without outlier, and compare results </li></ul>
34. 34. “ REGRESSION” WITH EXCEL <ul><li>Usually means drawing an X-Y plot, fitting a straight line and coming up with an R 2 value. </li></ul><ul><li>As long as R 2 is high, everything’s hunky-dory. </li></ul><ul><li>WRONG! </li></ul>
35. 35. “ REGRESSION” WITH EXCEL Relationship is clearly not linear, and should not be presented as such
36. 36. “ REGRESSION” WITH EXCEL <ul><li>Regression model checking – in Excel? </li></ul><ul><li>Residual plots: </li></ul><ul><ul><li>Normally distributed </li></ul></ul><ul><ul><li>Random pattern when plotted vs fitted values </li></ul></ul>OK Variance not homogeneous Model incorrect
37. 37. PITFALLS OF REGRESSION ANALYSIS <ul><li>Non-Linear Relationships </li></ul><ul><li>Influential Points </li></ul><ul><li>Extrapolating </li></ul><ul><li>Lurking Variables </li></ul><ul><li>Summary Data </li></ul><ul><li>Assuming Causation </li></ul>
38. 38. <ul><li>THAT’S (WITH REASONABLE PROBABILITY) THE END FOLKS! </li></ul><ul><li>And remember, </li></ul><ul><li>With statistics, you never have to say you’re certain! </li></ul>
39. 39. <ul><li>THANK YOU FOR YOUR ATTENTION </li></ul><ul><li>ARE THERE ANY QUESTIONS? </li></ul><ul><li>GOOD LUCK!! </li></ul>