Successfully reported this slideshow.
Upcoming SlideShare
×

# Descriptive Statistics

110 views

Published on

￼This presentation forms part of a free, online course on analytics

http://econ.anthonyjevans.com/courses/analytics/

Published in: Data & Analytics
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

### Descriptive Statistics

1. 1. Descriptive Statistics Krupnik Estate Agents Anthony J. Evans Professor of Economics, ESCP Europe www.anthonyjevans.com (cc) Anthony J. Evans 2019 | http://creativecommons.org/licenses/by-nc-sa/3.0/
2. 2. Weekly prices of studio apartments in West Hampstead 2 425 440 450 465 480 510 575 430 440 450 470 485 515 575 430 440 450 470 490 525 580 435 445 450 472 490 525 590 435 445 450 475 490 525 600 435 445 460 475 500 535 600 435 445 460 475 500 549 600 435 445 460480500 550 600 440 450 465 480 500 570 615 440 450465 480 510 570 615
3. 3. Weekly prices of studio apartments in West Hampstead (2) • The first task is to order the data set 3 425 430 430 435 435 435 435 435 440 440 440 440 440 445 445 445 445 445 450 450 450 450 450 450 450 460 460 460 465 465 465 470 470 472 475 475 475 480 480 480 480 485 490 490 490 500 500 500 500 510 510 515 525 525 525 535 549 550 570 570 575 575 580 590 600 600 600 600 615 615
4. 4. Foundations of descriptive statistics Measures of data Location Dispersion Description • A “typical” or average value • Used to summarise the distribution • The spread or variability of the data • Appreciate the differences in the data 4
5. 5. Measures of Location: Mean • The mean of a data set is the summation of all individual values, divided by the number of observations SamplePopulation 5Notice the use of Greek/upper case for populations and Latin/lower case for samples x = xi∑ n = 34,356 70 = 490.80 € µ = x∑ i N € x = x∑ i n
6. 6. Measures of Location: Median • The median of a data set is the value that divides the lower half of the distribution from the higher half • The median is the middle observation – i.e. the (n+1)/2th observation – In this case, 71/2 = 35.5th observation • If there are an even number of observations, take the mean of both middle values • Median = 475 6 425 430 430 435 435 435 435 435 440 440 440 440 440 445 445 445 445 445 450 450 450 450 450 450 450 460 460 460 465 465 465 470 470 472 475 475 475 480 480 480 480 485 490 490 490 500 500 500 500 510 510 515 525 525 525 535 549 550 570 570 575 575 580 590 600 600 600 600 615 615
7. 7. Measures of Location: Mode • The mode of a data set is the value that occurs with the greatest frequency. • If the data have exactly two modes, the data are bimodal • If the data have more than two modes, the data are multimodal • Mode = 450 7 425 430 430 435 435 435 435 435 440 440 440 440 440 445 445 445 445 445 450 450 450 450 450 450 450 460 460 460 465 465 465 470 470 472 475 475 475 480 480 480 480 485 490 490 490 500 500 500 500 510 510 515 525 525 525 535 549 550 570 570 575 575 580 590 600 600 600 600 615 615
8. 8. Measures of Location: Using Excel 8
9. 9. Measures of Location: Percentiles • The median is also known as the 50th percentile • The p th percentile of a data set is a value such that p percent of the items take on this value or less, and (100 - p) percent of the items take on this value or more – Arrange the data of ‘n’ items in ascending order – Compute index i, the position of the pth percentile – If i is not an integer, round up. The pth percentile is the value in the ith position. – If i is an integer, the pth percentile is the average of the values in positions i and i+1 • I is the position of the p percentile 9 € i = p 100 " # \$ % & 'n
10. 10. Measures of Location: 90th Percentile • i = (p/100)n = (90/100)*70 = 63 • It will be the 63rd value • Averaging the 63rd and 64th data values • 90th Percentile = 585 10 425 430 430 435 435 435 435 435 440 440 440 440 440 445 445 445 445 445 450 450 450 450 450 450 450 460 460 460 465 465 465 470 470 472 475 475 475 480 480 480 480 485 490 490 490 500 500 500 500 510 510 515 525 525 525 535 549 550 570 570 575 575 580 590 600 600 600 600 615 615
11. 11. Measures of Location: 25th Percentile • i = (p/100)n = (25/100)*70 = 17.5 • It will be the 18th value • 25th Percentile = 445 11 425 430 430 435 435 435 435 435 440 440 440 440 440 445 445 445 445 445 450 450 450 450 450 450 450 460 460 460 465 465 465 470 470 472 475 475 475 480 480 480 480 485 490 490 490 500 500 500 500 510 510 515 525 525 525 535 549 550 570 570 575 575 580 590 600 600 600 600 615 615
12. 12. GDP of UN member countries 12 \$267,015 \$18,171 0 50,000 100,000 150,000 200,000 250,000 300,000 Mean Median GDP On average, how many legs do English people have?
13. 13. Measures of Dispersion • in choosing supplier A or supplier B we should consider not only the average delivery time for each, but also the variability in delivery time for each A Mean = 0 B Mean = 0 Frequency 13
14. 14. Measures of Dispersion: Range • The range of a data set is the difference between the largest and smallest data values • Range = largest value - smallest value • Range = 615 - 425 = 190 14 425 430 430 435 435 435 435 435 440 440 440 440 440 445 445 445 445 445 450 450 450 450 450 450 450 460 460 460 465 465 465 470 470 472 475 475 475 480 480 480 480 485 490 490 490 500 500 500 500 510 510 515 525 525 525 535 549 550 570 570 575 575 580 590 600 600 600 600 615 615
15. 15. Measures of Dispersion: Interquartile range • The interquartile range of a data set is the difference between the first and third quartiles • Q1 = 25th percentile = 445 (from before) • Q3 = 75th percentile = 525 • Interquartile range = 525 - 445 = 80 15 425 430 430 435 435 435 435 435 440 440 440 440 440 445 445 445 445 445 450 450 450 450 450 450 450 460 460 460 465 465 465 470 470 472 475 475 475 480 480 480 480 485 490 490 490 500 500 500 500 510 510 515 525 525 525 535 549 550 570 570 575 575 580 590 600 600 600 600 615 615 € i = p 100 " # \$ % & 'n
16. 16. Measures of Dispersion: Variance • The variance is a measure of variability that utilizes all the data. • It is based on the difference between the value of each observation (xi) and the mean (x BAR for a sample, and m for a population) • The variance is the average of the squared differences between each data value and the mean. s 2 = (xi−x )2 ∑ n−1 σ2 = (xi −µ)2 ∑ N 16 SamplePopulation
17. 17. Measures of Dispersion: Standard Deviation • The standard deviation of a data set is the square root of the variance. • It is more easily comparable to the mean than the variance – standard deviation measures the spread about the mean using the original (not squared) scale • It ties into the Normal Distribution € s = s2 € σ = σ2 Sample Population 17 s = (xi −x )2 ∑ n−1
18. 18. What is the standard deviation? 18 i xi xi - x' (xi - x')2 1 425 -65.8 4329.64 2 440 -50.8 2580.64 3 450 -40.8 1664.64 4 465 -25.8 665.64 5 480 -10.8 116.64 6 510 19.2 368.64 7 575 84.2 7089.64 8 430 -60.8 3696.64 9 440 -50.8 2580.64 10 450 -40.8 1664.64 65 450 -40.8 1664.64 66 465 -25.8 665.64 67 480 -10.8 116.64 68 510 19.2 368.64 69 570 79.2 6272.64 70 615 124.2 15425.64 sum 206,735.20 /(n-1) 2,996.16 sqrt 54.74
19. 19. Measures of Dispersion: Examples • Variance • Standard Deviation s2 = (xi −x )2 ∑ n−1 =2,996 74.5429962 === ss 19
20. 20. Measures of Dispersion: z Score • The z - score is the standardised value • It denotes the number of standard deviations a data value xi is from the mean • A data value less than the sample mean will have a z-score less than zero • A data value greater than the sample mean will have a z -score greater than zero € zi = xi − x s 20
21. 21. Measures of Dispersion: z Score for smallest value • z-Score of Smallest Value (425) • Standardized Values for Apartment Rents: z x x s i= - = - = - 425 490 80 54 74 1 20 . . . 21 425 430 430 435 435 435 435 435 440 440 440 440 440 445 445 445 445 445 450 450 450 450 450 450 450 460 460 460 465 465 465 470 470 472 475 475 475 480 480 480 480 485 490 490 490 500 500 500 500 510 510 515 525 525 525 535 549 550 570 570 575 575 580 590 600 600 600 600 615 615
22. 22. Measures of Dispersion: Outliers • An outlier is an unusually small or unusually large value in a data set – It might be an incorrectly recorded data value – It might be a data value that was incorrectly included in the data set – It might be a correctly recorded data value that belongs in the data set! • A data value with a z-score less than -3 or greater than +3 might be considered an outlier 22
23. 23. Summary • There are two main ways to get feel for a set of numbers (a distribution) – location and dispersion • The mean and the standard deviation are the most frequent measures of location and dispersion but it’s important to understand the alternatives 23
24. 24. Solutions 24
25. 25. Measures of Location: Mean • The mean of a data set is the summation of all individual values, divided by the number of observations SamplePopulation 25Notice the use of Greek/upper case for populations and Latin/lower case for samples x = xi∑ n = 34,356 70 = 490.80 € µ = x∑ i N € x = x∑ i n
26. 26. Measures of Location: Median • The median of a data set is the value that divides the lower half of the distribution from the higher half • The median is the middle observation – i.e. the (n+1)/2th observation – In this case, 71/2 = 35.5th observation • If there are an even number of observations, take the mean of both middle values • Median = 475 26 425 430 430 435 435 435 435 435 440 440 440 440 440 445 445 445 445 445 450 450 450 450 450 450 450 460 460 460 465 465 465 470 470 472 475 475 475 480 480 480 480 485 490 490 490 500 500 500 500 510 510 515 525 525 525 535 549 550 570 570 575 575 580 590 600 600 600 600 615 615
27. 27. Measures of Location: Mode • The mode of a data set is the value that occurs with the greatest frequency. • If the data have exactly two modes, the data are bimodal • If the data have more than two modes, the data are multimodal • Mode = 450 27 425 430 430 435 435 435 435 435 440 440 440 440 440 445 445 445 445 445 450 450 450 450 450 450 450 460 460 460 465 465 465 470 470 472 475 475 475 480 480 480 480 485 490 490 490 500 500 500 500 510 510 515 525 525 525 535 549 550 570 570 575 575 580 590 600 600 600 600 615 615
28. 28. Measures of Location: 90th Percentile • i = (p/100)n = (90/100)*70 = 63 • It will be the 63rd value • Averaging the 63rd and 64th data values • 90th Percentile = 585 28 425 430 430 435 435 435 435 435 440 440 440 440 440 445 445 445 445 445 450 450 450 450 450 450 450 460 460 460 465 465 465 470 470 472 475 475 475 480 480 480 480 485 490 490 490 500 500 500 500 510 510 515 525 525 525 535 549 550 570 570 575 575 580 590 600 600 600 600 615 615
29. 29. Measures of Location: 25th Percentile • i = (p/100)n = (25/100)*70 = 17.5 • It will be the 18th value • 25th Percentile = 445 29 425 430 430 435 435 435 435 435 440 440 440 440 440 445 445 445 445 445 450 450 450 450 450 450 450 460 460 460 465 465 465 470 470 472 475 475 475 480 480 480 480 485 490 490 490 500 500 500 500 510 510 515 525 525 525 535 549 550 570 570 575 575 580 590 600 600 600 600 615 615
30. 30. Measures of Dispersion: Range • The range of a data set is the difference between the largest and smallest data values • Range = largest value - smallest value • Range = 615 - 425 = 190 30 425 430 430 435 435 435 435 435 440 440 440 440 440 445 445 445 445 445 450 450 450 450 450 450 450 460 460 460 465 465 465 470 470 472 475 475 475 480 480 480 480 485 490 490 490 500 500 500 500 510 510 515 525 525 525 535 549 550 570 570 575 575 580 590 600 600 600 600 615 615
31. 31. Measures of Dispersion: Interquartile range • The interquartile range of a data set is the difference between the first and third quartiles • Q1 = 25th percentile = 445 (from before) • Q3 = 75th percentile = 525 • Interquartile range = 525 - 445 = 80 31 425 430 430 435 435 435 435 435 440 440 440 440 440 445 445 445 445 445 450 450 450 450 450 450 450 460 460 460 465 465 465 470 470 472 475 475 475 480 480 480 480 485 490 490 490 500 500 500 500 510 510 515 525 525 525 535 549 550 570 570 575 575 580 590 600 600 600 600 615 615 € i = ( p 100 )n
32. 32. Measures of Dispersion: z Score for smallest value • z-Score of Smallest Value (425) • Standardized Values for Apartment Rents: z x x s i= - = - = - 425 490 80 54 74 1 20 . . . 32 425 430 430 435 435 435 435 435 440 440 440 440 440 445 445 445 445 445 450 450 450 450 450 450 450 460 460 460 465 465 465 470 470 472 475 475 475 480 480 480 480 485 490 490 490 500 500 500 500 510 510 515 525 525 525 535 549 550 570 570 575 575 580 590 600 600 600 600 615 615
33. 33. • This presentation forms part of a free, online course on analytics • http://econ.anthonyjevans.com/courses/analytics/ 33