Understanding Data

1,152 views

Published on

Published in: Education, Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,152
On SlideShare
0
From Embeds
0
Number of Embeds
91
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Understanding Data

  1. 1. Not Waving But Drowning Understanding Data Andrew Hingston Switch Solutions quant training solutions ahingston@switchsolutions.com.au
  2. 2. Find out from someone: Name and what they do Hobby or interest that others don’t know Waving or drowning in data? 2 Not Waving But Drowning by Stevie Smith Nobody heard him, the dead man, But still he lay moaning: I was much further out than you thought And not waving but drowning. Poor chap, he always loved larking And now he's dead It must have been too cold for him his heart gave way, They said. Oh, no nono, it was too cold always (Still the dead one lay moaning) I was much too far out all my life And not waving but drowning.
  3. 3. Course schedule Descriptive statistics Normal distribution Monitoring processes 1 Monitoring processes 2 Hypothesis testing Simple regression 1 Simple regression 2 Multiple regression 1 Multiple regression 2 Time series models 1 Time series models 2 Review 3
  4. 4. Learning objectives Understand and calculate descriptive statistics Plot and interpret a histogram Plot and interpret a box-plot Interpret descriptive statistics to solve basic business problems 4
  5. 5. Why understand data? Fact based decisions Avoiding bias Power and persuasion 5
  6. 6. Common biases Memorability Anchoring and adjustment Status quo Self-serving Negative comparisons Framing 6 Why understand data?
  7. 7. Sources of power Legitimate power (position) Referent power (loyalty) Expert power (skills and expertise) Reward power (material rewards) Coercive power (withhold rewards) French and Raven (1959)“The bases of social power”See Wikipedia:“Power (philosophy)” 7 Why understand data?
  8. 8. Principles of persuasion Reciprocity (favors) Consistency (commitment) Social Proof (herd) Authority Liking Scarcity Robert Cialdini (2001)“Influence: Science and practice”See Wikipedia “Robert Cialdini” 8 Why understand data?
  9. 9. Some data jargon 9 Why understand data
  10. 10. Steps for using data Specify problem Propose answers Identify the right tools Obtain your data Visualise it Crunch numbers Interpret, persuade, apply 10 Why understand data
  11. 11. Discussion question 1 occasion when charts or stats used well. 1 occasion when they were abused. 11
  12. 12. 12 1 Visualising Data
  13. 13. Why visualise your data? For you Fast understanding Build solid ‘foundation’ Flags problems 13 For others Easier to follow Memorable Less info overload More convincing
  14. 14. When to use each of these charts? 14 1. Visualising data with charts
  15. 15. Bar charts 15 1. Visualising data with charts
  16. 16. Column charts 16 1. Visualising data with charts
  17. 17. Line charts 17 1. Visualising data with charts Source: StatCounterGlobalStats
  18. 18. Scatterplot 18 1. Visualising data with charts
  19. 19. Discussion Rate these Mobile OS GUIs out of 10 based on overall attractiveness and functionality for the consumer market aged 16 to 30: 19 1. Visualising data with charts
  20. 20. Bubble chart 20 1. Visualising data with charts
  21. 21. Pie chart 21 1. Visualising data with charts
  22. 22. Stacked column 22 1. Visualising data with charts
  23. 23. Radar charts 23 1. Visualising data with charts
  24. 24. Compound charts (eg. Stock Chart) 24 1. Visualising data with charts
  25. 25. Chart presentation tips Must tell story in <10s Avoid complexity Think about colours Think about font sizes Use handouts creatively Avoid jargon 25 1. Visualising data with charts
  26. 26. Exercise Charting exercise using software and data. 26
  27. 27. 2. Measuring the middle Mean Median Mode Weighted average Trimmed mean 27
  28. 28. Mean Arithmetic average of a set of data points Sum values then divide by number of data points Example: mean return of ASX200 = 12% p.a. 28 Advantages Easy to calculate Easy to interpret for symmetric distributions Based on all the data Disadvantages Less useful with skewed data Affected by outliers 3. Numerical statistics
  29. 29. Median Middle data point when the data is ordered Example: Sydney median house price = $600k 29 Advantages Easy to obtain from a list of sorted data Easy to interpret Unaffected by outliers Disadvantages Only based on the ‘middle’ data point(s) and so can be more variable than the sample mean. 3. Numerical statistics
  30. 30. 3. Measuring the spread Max, min, range Inter-quartile range Standard deviation Coefficient of variation 30
  31. 31. Standard deviation and variance Measures ‘variability’ or ‘spread’ of data Based on how far each score varies from mean Common notation: S2 or 2 = variance S or  = standard deviation 31 3. Numerical statistics
  32. 32. Spread and normal distribution 32 3. Numerical statistics +3SD 2SD +2SD +1SD 3SD 1SD Mean ASX200 Mean = 10% SD (Std dev) = 10% 68.2% chance 95.4% chance 99.7% chance ... more on this in next unit!
  33. 33. Inter-quartile range (IQR) Spread of middle 50% of data IQR = Q3 – Q1 where Order data from small to large Q1 is the data point at the end of the first quarter Q3 is the data point at the end of the third quarter 33 3. Numerical statistics
  34. 34. Exercise Calculate descriptive statistics for a data set Express what each one means in your own words. 34
  35. 35. 4. Visualising with histograms 35
  36. 36. Histogram advantages 36 Easy to construct Easy to interpret Indicate symmetryor skewness Indicates multimodality (>1 peak) Can be used if data comes in grouped form 2. Graphical representation
  37. 37. Histogram disadvantages 37 Original data points? Width of class intervals effects appearance Class intervals often not well chosen Small samples can be misleading 2. Graphical representation
  38. 38. Histogram stories 38 2. Graphical representation Double - Peaked Bell - Shaped Comb Plateau Skewed Truncated Edge - Peaked Isolated - Peaked
  39. 39. Determined by: Number of data points (as number, width) Spread of the data (as spread , width ) Best calculated using: IQR = inter-quartile range (spread of middle 50% of data) n = number of data points Histogram interval width 39 2. Graphical representation
  40. 40. 5. Visualising with box plots 40
  41. 41. Box plots A rich graphical representation that shows: Location of data (mean and median) Spread (inter-quartile range plus visual) Symmetry Extreme data points (outliers) Very useful but underutilised since: Most managers have weak data skills Excel can’t do them ... need a stats package! 41 4. Box plots
  42. 42. Box plot example 42
  43. 43. Interpreting shape 43 4. Box plots Right-Skewed Left-Skewed Symmetric Q Median Q Q Median Q Q Median Q 1 3 1 3 1 3 * * * Mean
  44. 44. Box plot comparisons Box plots useful for comparing samples If boxes do not overlap: Strong evidence that two samples are different Preferably more than 10 data points! If boxes do overlap: Samples may or may not be different Need to use more advanced techniques (later) 44 4. Box plots
  45. 45. Bishop’s supermarket Describe the shape of register receipts Too many bins in the histogram? Mild or extreme outliers?What should we do with them? How much do customers typically spend? 45 End of unit exercise 1.1
  46. 46. End of unit exercise 1.2 Chan’s laundry Does discount size affect profitability?If so, how much? Recommendations? 46
  47. 47. Class discussion question If the long-run average return and standard deviation of the ASX200 have both been 10% per year, what is the likelihood of the -40% returns experienced in 2008? 47 Data analysis
  48. 48. Normal distribution 48 2. Normal distribution       68.2% chance 95.4% chance 99.7% chance
  49. 49. Features Bell-shaped with single peak (unimodal) Mean = median = mode =  Symmetrical around mean () Skewness = 0 (if -ve then tail on left, bulge on right) Kurtosis = 3 (if > 3 then pinched with fat tails) There are values from – to + Total area under curve = 1 (probability of all events) Combination of 2+ normal variables is normal 49 2. Normal distribution
  50. 50. Different means 50 =1  =1 The means of two variables can be differentbut they can both still be normal 2. Normal distribution
  51. 51. Different standard deviations 51 =1 =2 The standard deviation of two variables can be different but they can both still be normal 2. Normal distribution
  52. 52. Using Excel for probabilities 52 2. Normal distribution =2 Value of X = 3 Mean () = 2 Std dev () = 1  =1 X = Calculates probability between  and X In Excel ... =NORMDIST( X ,  ,  , 1) In this case ... =NORMDIST( 3 , 2 , 1 , 1) = 0.84
  53. 53. Using Excel for value of X 53 2. Normal distribution =2 Probability = 0.84 Mean () = 2 Std dev () = 1  =1 p = 0.84 Calculates X for a probability between  and X In Excel ... =NORMINV ( p ,  ,  ) In this case ... =NORMINV ( 0.84 , 2 , 1 ) = 3
  54. 54. Standard normal distribution 54 2. Normal distribution =0  =1 Z = Special normal distribution with  = 0,  = 1 Used to generalise for all normal distributions Z-score = number of std deviations from mean
  55. 55. Using Excel for std normal distribution 55 2. Normal distribution =0 Value of Z = 1 Mean () = 0 Std dev () = 1  =1 Z = Calculates probability between  and Z In Excel ... =NORMSDIST ( Z ) In this case ... =NORMSDIST ( 1 ) = 0.84 And also ... =NORMSINV ( 0.84 ) = 1
  56. 56. Negative Z-scores 56 2. Normal distribution =0  =1 Z = Negative Z-Scores happen when X <  (mean) Number of std deviations to left of mean IN Excel ... =NORMSDIST ( -1 ) = 0.16
  57. 57. Fun with Z-scores 57 2. Normal distribution P ( - < Z < + ) = P ( Z < 0 ) = P ( Z > 0 ) = P ( Z < 1 ) = P ( Z > -1 ) = P ( Z > 1 ) = P ( Z < -1 ) = P ( 0 < Z < 1 ) = P ( -1 < Z < 1 ) = =0  =1 Z = P ( - < Z < 1 ) = 0.84 You can use this result for lot’s of other regions!
  58. 58. Fun with Z-scores 58 2. Normal distribution P ( - < Z < + ) = P ( Z < 0 ) = P ( Z > 0 ) = P ( Z < 1 ) = P ( Z > -1 ) = P ( Z > 1 ) = P ( Z < -1 ) = P ( 0 < Z < 1 ) = P ( -1 < Z < 1 ) = 1.00 0.50 0.50 0.84 0.84 0.16 0.16 0.34 0.68 =0  =1 Z = P ( - < Z < 1 ) = 0.84 You can use this result for lot’s of other regions!
  59. 59. Fun with probability 59 2. Normal distribution P ( - < X < + ) = P ( X < 2 ) = P ( X > 2 ) = P ( X < 3 ) = P ( X > 1 ) = P ( X > 3 ) = P ( X < 1 ) = P ( 2 < X < 3 ) = P ( 1 < X < 3 ) = 1.00 0.50 0.50 0.84 0.84 0.16 0.16 0.34 0.68 =2  =1 X = P ( - < X < 3 ) = 0.84 Use same logic as before orconvert X to Z-score (Z = 1)
  60. 60. Chebyshev’s rule ‘Rough as guts’ estimation of probability Use when variable is not normally distributed The rule: At least 3/4 will fall within 2 std deviations of mean At least 8/9 will fall within 3 std deviations of mean Beware! Even if you don’t know distribution of variable ... ... you might know distribution of the sample mean or the average! See Central Limit Theorem later! 60
  61. 61. Problem solving tips Write down variables from question  = 5,  = 2, P(X>6) = ??? Draw a quick bell shaped diagram Mark  in middle of bell and position of X Shade the region that you are trying to find Look for provided NORMDIST or NORMSDIST If provided NORMSDIST then calculate Z-Score If provided NORMDIST then usually don’t have to Identify probability of correct region(s) 61 Exam tips
  62. 62. End of unit exercises Rework NORMSDIST(2.31) = 0.9896 NORMSINV(0.99) = 2.326 NORMSINV(0.15) = -1.036 Quality NORMSDIST(-2) = 0.0228 NORMSDIST(2.5) = 0.9938 Coal yield NORMSINV(0.2) = -0.842 NORMSINV(0.01) = -2.326 Answers 1a) 0.0104 1b) 0.05976 1c) 0.00942 2. $9.66 3a) 0.1587 3b) 792.1 3c) 42.99 62 Coal yield
  63. 63. Distribution of sample means Take lots of random samples from a population Features of the mean of each sample? Mean of each sample will be a bit different They should be quite close to the population mean For big samples ... mean should be close For small samples ... mean could be very different 63 3. Central Limit Theorem Sample1 Sample3 Population Sample2 Sample4
  64. 64. Populations have lots of distributions! 64 3. Central Limit Theorem Double - Peaked Bell - Shaped Comb Plateau Skewed Truncated Edge - Peaked Isolated - Peaked
  65. 65. ... but means of sample are normal! Central Limit Theorem It doesn’t matter how population is distributed ... ... if you take a sufficiently large sample (n>30) ... the probability distribution of sample means ... will be approximately normally distributed ... around the population mean What is a ‘sufficiently large’? The bigger the sample the more ‘normal’ the mean For this course, if sample size n > 30 65 3. Central Limit Theorem
  66. 66. Demonstration 66 http://onlinestatbook.com/stat_sim/sampling_dist/index.html 3. Central Limit Theorem
  67. 67. Spread of sample means Sample means will almost never be the same as the true population mean! Sample mean will be most accurate when: Sample size (n) is big Spread of the population values () is low Standard error Measures the spread of sample mean Sometimes called SE or sd (X) 67 3. Central Limit Theorem
  68. 68. Probability of sample mean 68 Sample mean (X) is normal (if sample size >30) Z-score = std errors from population mean Probability mean of sample < 3 = 0.84= NORMSDIST (1) 3. Central Limit Theorem =2 X =
  69. 69. When to use ... Taking a sample oraverage of a variable(which doesn’t need to be normal) Probability thatsample mean oraverage of the variabletakes on certain values (n > 30 if it is a sample) Variable is normally distributed Probability the variable takes on certain values 69 3. Central Limit Theorem

×