> x=11
> print(x)
[1] 11
> x
[1] 11
> X
Error: object 'X' not found
> y<-7
> y
[1] 7
> y<-9
> y
[1] 9

> ls()
[1] "x" "y"
...
Entering data with c
• c function for small datasets – combines or concatenates
terms together
Example: we have a count of...
Learning Objectives
• What is statistics?
• Become aware of the varied applications of statistics in
business.
• Different...
Statistics in Business
• Accounting — auditing and cost estimation
• Economics — local, regional, national, and internatio...
What is Statistics?
• Science dealing with
collection, analysis, interpretation and
presentation of data (with a view to m...
What is Statistics?

Statistics in business is the study of VARIATIONS
Population Versus Sample
• Population — the whole
– a collection of all persons, objects, or items under
study
• Census — ...
Population Versus Sample
Population and Census Data
Identifier

Color

MPG

RD1

Red

12

RD2

Red

10

RD3

Red

13

RD4

Red

10

RD5

Red

13

B...
Sample and Sample Data
Identifier

Color

MPG

RD2

Red

10

RD5

Red

13

GR1

Green

35

GY2

Gray

18
Population Versus Sample

Select a
random sample
Parameter vs. Statistic
• Parameter — descriptive measure of the population
– Usually represented by Greek letters
 denot...
Statistics in Business
• Inferences about parameters made under
conditions of uncertainty (which are always
present in sta...
Variables and Data
Variable : a characteristic of any entity being studied – is
capable of taking on different values that...
Levels of Data Measurement
What are 40 and 80? may represent
Weights of two objects being shipped
Ratings received in a ...
Levels of Data Measurement
• Nominal — Qualitative data, typically numbers
are used only to classify or categorize the
att...
Levels of Data Measurement
• Ordinal - A variable is ordinal measurable if
ranking or ordering is possible for values of
t...
Levels of Data Measurement
• Interval - In interval measurement the
distance between attributes does have
meaning.
– Numer...
Levels of Data Measurement
• Ratio — in ratio measurement there is always
a reference point that is meaningful (either 0
f...
Visualizing the data
• Construct a frequency distribution
– For both grouped and ungrouped data

• Construct graphical sum...
Ungrouped vs.Grouped Data
• Ungrouped data
– have not been summarized in any way
– are also called raw data

• Grouped dat...
Example of Ungrouped Data
42

26

32

34

57

30

58

37

50

30

53

40

30

47

49

50

40

32

31

40

52

28

23

35

...
Frequency Distribution
• Frequency Distribution – summary of data
presented in the form of class intervals and
frequencies...
Frequency Distribution
• Steps in Frequency Distribution
– Step 1 – Determine range of frequency distribution
• Range is t...
Frequency Distribution of Child
Care Manager’s Ages
Class Interval

Frequency

20-under 30

6

30-under 40

18

40-under 5...
Relative Frequency
Relative frequency is the proportion of the total frequency that
is in any given class interval in a fr...
Cumulative Frequency
Cumulative frequency is a running total of frequencies through
the classes of a frequency distributio...
Cumulative Relative Frequencies
Cumulative relative frequency is a running total of the relative
frequencies through the c...
Common Statistical Graphs
– Quantitative Data
•
•
•
•
•

Histogram -- vertical bar chart of frequencies
Frequency Polygon ...
Histogram
• A histogram is a graphical summary of a
frequency distribution
• Labeling x-axis with class endpoints and y-ax...
Data Range
42

26

32

34

57

30

58

37

50

30

53

40

30

47

49

50

40

32

31

40

52

28

23

35

25

30

36

32
...
Number of Classes
and Class Width
• The number of classes should be between 5 and 15.
– Fewer than 5 classes cause excessi...
Class midpoint or Class mark
The midpoint of each class interval is called the
class midpoint or the class mark.
Midpoints for Age Classes

Class Interval
20-under 30
30-under 40
40-under 50
50-under 60
60-under 70
70-under 80
Total

F...
Midpoints for Age Classes

Class Interval
20-under 30
30-under 40
40-under 50
50-under 60
60-under 70
70-under 80
Total

F...
Histogram
Class Interval Frequency
20-under 30
6
30-under 40
18
40-under 50
11
50-under 60
11
60-under 70
3
70-under 80
1
10
0

A graphical display of
class frequencies

Frequency

Class Interval Frequency
20-under 30
6
30-under 40
18
40-under ...
Relative Frequency Ogive
Cumulative
Relative
Class Interval

Frequency

20-under 30

.12

30-under 40

.48

40-under 50

....
Stem and Leaf plot:
Safety Examination Scores for Plant Trainees

Raw Data

Stem

Leaf

86

77

91

60

55

2

3

76

92

...
Stem and Leaf plot: Construction
Raw Data
86

77

91

60

Stem
55

Leaf

2

3

3

9

4

79

5

569

Leaf

6

07788

67

7
...
Histogram vs. Stem and Leaf?
• So, which one should you use?
• A Stem and Leaf plot is useful for small data
sets. It show...
Common Statistical Graphs
– Qualitative Data
• Pie Chart -- proportional representation for
categories of a whole
• Bar Ch...
Complaints by Amtrak Passengers
COMPLAINT

NUMBER PROPORTION

DEGREES

Stations, etc.

28,000

.40

144.0

Train
Performan...
Complaints by Amtrak Passengers
Second Quarter U.S. Truck Production
Second Quarter Truck
Production in the U.S.
(Hypothetical values)

Company

2d Quarte...
Second Quarter
U.S. Truck Production
Pie Chart Calculations
for Company A

Company

2d Quarter
Truck
Production

Proportion

Degrees

A

357,411

.388

140

B
...
Vertical Bar Graphs or Column Charts
6
5

4
Kolkata

3

Mumbai

Chennai

2
1

0
2010

2011

2012

2013
Horizontal Bar Chart
2013

2012
Chennai
Mumbai

2011

Kolkata

2010
0

2

4

6
Pareto Chart
A pareto chart is a bar chart, sorted from the most frequent to the
least frequent, overlaid with a cumulativ...
Scatter Plot
Registered
Vehicles
(1000's)

Gasoline Sales
(1000's of
Gallons)

5

60

15

120

9

90

15

140

7

60
Common Statistical Graphs –
Comparing Two Variables
• Scatter Plot -- type of display using Cartesian
coordinates to displ...
Measures of Central Tendency
& Dispersion:
Learning Objectives

• Distinguish between measures of central
tendency, measur...
Measures of Central Tendency
& Dispersion:
Learning Objectives - continued

• Understand the meaning of standard deviation...
Measures of Central Tendency:
Ungrouped Data
• Measures of central tendency yield information
about “the centre, or middle...
Mode
• Mode - the most frequently occurring value in a
data set
– Applicable to all levels of data measurement
(nominal, o...
Median
• Median - middle value in an ordered array of
numbers.
– Half the data are above it, half the data are below it
– ...
Arithmetic Mean
•
•
•
•

Mean is the average of a group of numbers
Applicable for interval and ratio data
Not applicable f...
Demonstration Problem
The number of U.S. cars in service by top car rental
companies in a recent year according to Auto Re...
Demonstration Problem
•

Solutions

Solution

Mode: 9,000 (two companies with 9,000 cars in
service)

Median: With 13 diff...
Percentile
• Percentile - measures of central tendency that divide a
group of data into 100 parts
• At least n% of the dat...
Calculating Percentiles
• To calculate the pth percentile,
– Order the data
– Calculate i = N (p/100)
– Determine the perc...
Quartiles
• Quartile - measures of central tendency that divide a
group of data into four subgroups
• Q1: 25% of the data ...
Quartiles for Demonstration Problem

For the cars in service data, n=13, so
Q1: i = 13 (25/100) = 3.25, so use the 4th ord...
Which Measure Do I Use?
• Which measure of central tendency is most
appropriate?
– In general, the mean is preferred, sinc...
Statistics with R
Statistics with R
Upcoming SlideShare
Loading in …5
×

Statistics with R

1,012 views
688 views

Published on

Praxis Weekend Business Analytics Program

Sessions 1 to 3

0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,012
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
38
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Statistics with R

  1. 1. > x=11 > print(x) [1] 11 > x [1] 11 > X Error: object 'X' not found > y<-7 > y [1] 7 > y<-9 > y [1] 9 > ls() [1] "x" "y" > rm(y) > y Error: object 'y' not found > y<-9 > x.1<-14 > x.1 [1] 14 > 1x<-22 Error: unexpected symbol in "1x"
  2. 2. Entering data with c • c function for small datasets – combines or concatenates terms together Example: we have a count of the number of typing mistakes of a word document: 02132011 To enter this into an R session we go like this: > typo=c(0,2,1,3,2,0,1,1) > typo [1] 0 2 1 3 2 0 1 1
  3. 3. Learning Objectives • What is statistics? • Become aware of the varied applications of statistics in business. • Differentiate between descriptive and inferential statistics. • Identify types of variables.
  4. 4. Statistics in Business • Accounting — auditing and cost estimation • Economics — local, regional, national, and international economic performance • Finance — investments and portfolio management • Management — human resources, compensation, and quality management • Management Information Systems — performance of systems which gather, summarize, and disseminate information to various managerial levels • Marketing — market analysis and consumer research • International Business — market and demographic analysis
  5. 5. What is Statistics? • Science dealing with collection, analysis, interpretation and presentation of data (with a view to making inferences) • Branches of statistics: – Descriptive – graphical or numerical summaries of data – Inferential – making a decision based on data
  6. 6. What is Statistics? Statistics in business is the study of VARIATIONS
  7. 7. Population Versus Sample • Population — the whole – a collection of all persons, objects, or items under study • Census — gathering data from the entire population • Sample — gathering data on a subset of the population – Use information about the sample to infer about the population
  8. 8. Population Versus Sample
  9. 9. Population and Census Data Identifier Color MPG RD1 Red 12 RD2 Red 10 RD3 Red 13 RD4 Red 10 RD5 Red 13 BL1 Blue 27 BL2 Blue 24 GR1 Green 35 GR2 Green 35 GY1 Gray 15 GY2 Gray 18 GY3 Gray 17
  10. 10. Sample and Sample Data Identifier Color MPG RD2 Red 10 RD5 Red 13 GR1 Green 35 GY2 Gray 18
  11. 11. Population Versus Sample Select a random sample
  12. 12. Parameter vs. Statistic • Parameter — descriptive measure of the population – Usually represented by Greek letters  denotes population parameter  2 denotes population variance  denotes population standard deviation • Statistic — descriptive measure of a sample – Usually represented by Roman letters x denotes sample mean s 2 denotes sample variance s denotes sample standarddeviation
  13. 13. Statistics in Business • Inferences about parameters made under conditions of uncertainty (which are always present in statistics) – Uncertainty can be caused by • Randomness in selection of a sample • Lack of knowledge about the source of the inferences • Change in conditions not accounted for
  14. 14. Variables and Data Variable : a characteristic of any entity being studied – is capable of taking on different values that can be used for analysis e.g. stock price, ROI, market share, age of worker, income of a family, total sales, advertising cost etc Measurement : is done when a standard process is used to assign numbers to particular characteristics of a variable – may be obvious or defined e.g. age is obvious but ROI or Labour productivity is defined The source of each measurement is called a Sampling unit Data : recorded measurements
  15. 15. Levels of Data Measurement What are 40 and 80? may represent Weights of two objects being shipped Ratings received in a consumer test by two different products Football jersey numbers of a fullback and centreforward Appropriateness of data analysis depends on the level of measurement of the data gathered
  16. 16. Levels of Data Measurement • Nominal — Qualitative data, typically numbers are used only to classify or categorize the attribute, however it is useful to retain original verbal descriptions of categories – 1 for “male” and 2 for “female” – Employee identification number – Religion, Geographic location, PIN code, Place of birth – Demographic questions in survey etc
  17. 17. Levels of Data Measurement • Ordinal - A variable is ordinal measurable if ranking or ordering is possible for values of the variable. – For example, a gold medal reflects superior performance to a silver or bronze medal in the Olympics. But can you say a gold and a bronze medal average out to a silver medal? – Preference scales are typically ordinal – how much do you like this cereal? Like it a lot, somewhat like it, neutral, somewhat dislike it, dislike it a lot.
  18. 18. Levels of Data Measurement • Interval - In interval measurement the distance between attributes does have meaning. – Numerical data typically fall into this category – For example, when measuring temperature (in Fahrenheit), the distance from 30-40 is same as the distance from 70-80. The interval between values is interpretable.
  19. 19. Levels of Data Measurement • Ratio — in ratio measurement there is always a reference point that is meaningful (either 0 for rates or 1 for ratios) – This means that you can construct a meaningful fraction (or ratio) with a ratio variable. – In applied social research most "count" variables are ratio, for example, the number of clients in past six months.
  20. 20. Visualizing the data • Construct a frequency distribution – For both grouped and ungrouped data • Construct graphical summaries of qualitative data • Construct graphical summaries of quantitative data • Construct graphical summaries of two variables
  21. 21. Ungrouped vs.Grouped Data • Ungrouped data – have not been summarized in any way – are also called raw data • Grouped data – logical groupings of data exists • i.e. age ranges (20-29, 30-39, etc.) – have been organized into a frequency distribution
  22. 22. Example of Ungrouped Data 42 26 32 34 57 30 58 37 50 30 53 40 30 47 49 50 40 32 31 40 52 28 23 35 25 30 36 32 26 50 55 30 58 64 52 49 33 43 46 32 61 31 30 40 60 74 37 29 43 54 Ages of a sample of Managers from Urban Child Care Centres in US
  23. 23. Frequency Distribution • Frequency Distribution – summary of data presented in the form of class intervals and frequencies – Vary in shape and design – Constructed according to the individual researcher's preferences
  24. 24. Frequency Distribution • Steps in Frequency Distribution – Step 1 – Determine range of frequency distribution • Range is the difference between the high and the lowest numbers – Step 2 – Determine the number of classes • Do not use too many, or two few classes – Step 3 – Determine the width of the class interval • Approx. class width can be calculated by dividing the range by the number of classes • Values fit into only one class
  25. 25. Frequency Distribution of Child Care Manager’s Ages Class Interval Frequency 20-under 30 6 30-under 40 18 40-under 50 11 50-under 60 11 60-under 70 3 70-under 80 1
  26. 26. Relative Frequency Relative frequency is the proportion of the total frequency that is in any given class interval in a frequency distributionrtion of the total frequency that is any given class interval in a frequency distribution. Class Interval 20-under 30 30-under 40 40-under 50 50-under 60 60-under 70 70-under 80 Total Frequency 6 18 11 11 3 1 50 Relative Frequency 6 .12  50 .36 18 .22  50 .22 .06 .02 1.00
  27. 27. Cumulative Frequency Cumulative frequency is a running total of frequencies through the classes of a frequency distributionen class interval in a frequency distribution. Class Interval 20-under 30 30-under 40 40-under 50 50-under 60 60-under 70 70-under 80 Total Frequency 6 18 11 11 3 1 50 Cumulative Frequency 6 24 18 + 6 35 11 + 24 46 49 50
  28. 28. Cumulative Relative Frequencies Cumulative relative frequency is a running total of the relative frequencies through the classes of a frequency distributione total frequency Cumulative Relative Cumulative Relative Class Interval Frequency Frequency Frequency Frequency 20-under 30 6 .12 6 .12 30-under 40 18 .36 24 .48 40-under 50 11 .22 35 .70 50-under 60 11 .22 46 .92 60-under 70 3 .06 49 .98 70-under 80 1 .02 50 1.00 Total 50 1.00
  29. 29. Common Statistical Graphs – Quantitative Data • • • • • Histogram -- vertical bar chart of frequencies Frequency Polygon -- line graph of frequencies Ogive -- line graph of cumulative frequencies Dot Plots – each data value is plotted Stem and Leaf Plot -- Like a histogram, but shows individual data values. Useful for small data sets.
  30. 30. Histogram • A histogram is a graphical summary of a frequency distribution • Labeling x-axis with class endpoints and y-axis with frequencies, drawing a horizontal line between two class endpoints at each frequency value • The number and location of rectangles (bars) should be determined based on the sample size and the range of the data
  31. 31. Data Range 42 26 32 34 57 30 58 37 50 30 53 40 30 47 49 50 40 32 31 40 52 28 23 35 25 30 36 32 26 50 55 30 58 64 52 49 33 43 46 32 61 31 30 40 60 74 37 29 43 54 Range = Largest - Smallest = 74 - 23 = 51 Smallest Largest
  32. 32. Number of Classes and Class Width • The number of classes should be between 5 and 15. – Fewer than 5 classes cause excessive summarization. – More than 15 classes leave too much detail. • Class Width – Divide the range by the number of classes for an approximate class width – Round up to a convenient number
  33. 33. Class midpoint or Class mark The midpoint of each class interval is called the class midpoint or the class mark.
  34. 34. Midpoints for Age Classes Class Interval 20-under 30 30-under 40 40-under 50 50-under 60 60-under 70 70-under 80 Total Frequency 6 18 11 11 3 1 50 Midpoint 25 35 45 55 65 75 Relative Frequency .12 .36 .22 .22 .06 .02 1.00 Cumulative Frequency 6 24 35 46 49 50
  35. 35. Midpoints for Age Classes Class Interval 20-under 30 30-under 40 40-under 50 50-under 60 60-under 70 70-under 80 Total Frequency 6 18 11 11 3 1 50 Midpoint 25 35 45 55 65 75 Relative Frequency .12 .36 .22 .22 .06 .02 1.00 Cumulative Frequency 6 24 35 46 49 50
  36. 36. Histogram Class Interval Frequency 20-under 30 6 30-under 40 18 40-under 50 11 50-under 60 11 60-under 70 3 70-under 80 1
  37. 37. 10 0 A graphical display of class frequencies Frequency Class Interval Frequency 20-under 30 6 30-under 40 18 40-under 50 11 50-under 60 11 60-under 70 3 70-under 80 1 20 Frequency Polygon 0 10 20 30 40 50 60 70 80 Years
  38. 38. Relative Frequency Ogive Cumulative Relative Class Interval Frequency 20-under 30 .12 30-under 40 .48 40-under 50 .70 50-under 60 .92 60-under 70 .98 70-under 80 1.00
  39. 39. Stem and Leaf plot: Safety Examination Scores for Plant Trainees Raw Data Stem Leaf 86 77 91 60 55 2 3 76 92 47 88 67 3 9 23 59 72 75 83 4 79 5 569 6 07788 77 68 82 97 89 81 75 74 39 67 7 0245567789 79 83 70 78 91 8 11233689 68 49 56 94 81 9 11247
  40. 40. Stem and Leaf plot: Construction Raw Data 86 77 91 60 Stem 55 Leaf 2 3 3 9 4 79 5 569 Leaf 6 07788 67 7 0245567789 78 91 8 11233689 Leaf 94 81 9 11247 76 92 47 88 23 59 72 75 77 68 82 97 81 75 74 39 79 83 70 68 49 56 Stem Stem 67 83 89
  41. 41. Histogram vs. Stem and Leaf? • So, which one should you use? • A Stem and Leaf plot is useful for small data sets. It shows the values of the datapoints. • A histogram foregoes seeing the individual values of the data for the bigger picture of the distribution of the data • The purpose of these graphs is to summarize a set of data. As long as that need is met, either one is okay to use.
  42. 42. Common Statistical Graphs – Qualitative Data • Pie Chart -- proportional representation for categories of a whole • Bar Chart – frequency or relative frequency of one more categorical variables
  43. 43. Complaints by Amtrak Passengers COMPLAINT NUMBER PROPORTION DEGREES Stations, etc. 28,000 .40 144.0 Train Performance Equipment 14,700 .21 75.6 10,500 .15 54.0 Personnel 9,800 .14 50.4 Schedules, etc. Total 7,000 .10 36.0 70,000 1.00 360.0
  44. 44. Complaints by Amtrak Passengers
  45. 45. Second Quarter U.S. Truck Production Second Quarter Truck Production in the U.S. (Hypothetical values) Company 2d Quarter Truck Production A 357,411 B 354,936 C 160,997 D 34,099 E Totals 12,747 920,190
  46. 46. Second Quarter U.S. Truck Production
  47. 47. Pie Chart Calculations for Company A Company 2d Quarter Truck Production Proportion Degrees A 357,411 .388 140 B 354,936 .386 139 C 160,997 .175 63 D 34,099 .037 13 12,747 920,190 .014 1.000 5 360 E Totals
  48. 48. Vertical Bar Graphs or Column Charts 6 5 4 Kolkata 3 Mumbai Chennai 2 1 0 2010 2011 2012 2013
  49. 49. Horizontal Bar Chart 2013 2012 Chennai Mumbai 2011 Kolkata 2010 0 2 4 6
  50. 50. Pareto Chart A pareto chart is a bar chart, sorted from the most frequent to the least frequent, overlaid with a cumulative line graph (like an ogive). These data present the most common types of defects. 100% 90% 80 70 Frequency 100 90 80% 70% 60 50 40 60% 50% 40% 30 20 30% 20% 10 0 10% 0% Poor Wiring Short in Coil Defective Plug Other
  51. 51. Scatter Plot Registered Vehicles (1000's) Gasoline Sales (1000's of Gallons) 5 60 15 120 9 90 15 140 7 60
  52. 52. Common Statistical Graphs – Comparing Two Variables • Scatter Plot -- type of display using Cartesian coordinates to display values for two variables for a set of data. – The data is displayed as a collection of points, each having the value of one variable determining the position on the horizontal axis and the value of the other variable determining the position on the vertical axis. – A scatter plot is also called a scatter chart, scatter diagram and scatter graph.
  53. 53. Measures of Central Tendency & Dispersion: Learning Objectives • Distinguish between measures of central tendency, measures of variability, measures of shape, and measures of association. • Understand the meanings of mean, median, mode, quartile, percentile, and range. • Compute mean, median, mode, percentile, quartile, range, v ariance, standard deviation, and mean absolute deviation on ungrouped data. • Differentiate between sample and population variance and standard deviation.
  54. 54. Measures of Central Tendency & Dispersion: Learning Objectives - continued • Understand the meaning of standard deviation as it is applied by using the empirical rule and Chebyshev’s theorem. • Compute the mean, median, standard deviation, and variance on grouped data. • Understand box and whisker plots, skewness, and kurtosis. • Compute a coefficient of correlation and interpret it.
  55. 55. Measures of Central Tendency: Ungrouped Data • Measures of central tendency yield information about “the centre, or middle part, of a group of numbers.” • Measures of central tendency do not focus on the span of the data set or how far values are from the middle numbers • Common Measures of Location – – – – – Mode Median Mean Percentiles Quartiles
  56. 56. Mode • Mode - the most frequently occurring value in a data set – Applicable to all levels of data measurement (nominal, ordinal, interval, and ratio) – Can be used to determine what categories occur most frequently – Sometimes, no mode exists (no duplicates) • Bimodal – In a tie for the most frequently occurring value, two modes are listed • Multimodal -- Data sets that contain more than two modes
  57. 57. Median • Median - middle value in an ordered array of numbers. – Half the data are above it, half the data are below it – Mathematically, it is the (n+1)/2 th ordered observation • For an array with an odd number of terms, the median is the middle number – n=11 => (n+1)/2 th = 12/2 th = 6th ordered observation • For an array with an even number of terms the median is the average of the middle two numbers – n=10 => (n+1)/2 th = 11/2 th = 5.5th = average of 5th and 6th ordered observation
  58. 58. Arithmetic Mean • • • • Mean is the average of a group of numbers Applicable for interval and ratio data Not applicable for nominal or ordinal data Affected by each value in the data set, including extreme values • Computed by summing all values in the data set and dividing the sum by the number of values in the data set
  59. 59. Demonstration Problem The number of U.S. cars in service by top car rental companies in a recent year according to Auto Rental News follows. Company / Number of Cars in Service Enterprise 643,000; Hertz 327,000; National/Alamo 233,000; Avis 204,000; Dollar/Thrifty 167,000; Budget 144,000; Advantage 20,000; U-Save 12,000; Payless 10,000; ACE 9,000; Fox 9,000; Rent-A-Wreck 7,000; Triangle 6,000 Compute the mode, the median, and the mean.
  60. 60. Demonstration Problem • Solutions Solution Mode: 9,000 (two companies with 9,000 cars in service) Median: With 13 different companies in this group, N = 13. The median is located at the (13 +1)/2 = 7th position. Because the data are already ordered, median is the 7th term, which is 20,000. Mean: μ = ∑x/N = (1,791,000/13) = 137,769.23
  61. 61. Percentile • Percentile - measures of central tendency that divide a group of data into 100 parts • At least n% of the data lie at or below the nth percentile, and at most (100 - n)% of the data lie above the nth percentile • Example: 90th percentile indicates that at 90% of the data are equal to or less than it, and 10% of the data lie above it
  62. 62. Calculating Percentiles • To calculate the pth percentile, – Order the data – Calculate i = N (p/100) – Determine the percentile • If i is a whole number, then use the average of the ith and (i+1)th ordered observation • Otherwise, round i up to the next highest whole number
  63. 63. Quartiles • Quartile - measures of central tendency that divide a group of data into four subgroups • Q1: 25% of the data set is below the first quartile • Q2: 50% of the data set is below the second quartile • Q3: 75% of the data set is below the third quartile Q2 Q1 25% 25% Q3 25% 25%
  64. 64. Quartiles for Demonstration Problem For the cars in service data, n=13, so Q1: i = 13 (25/100) = 3.25, so use the 4th ordered observation Q1 = 9,000 Q3: i = 13 (75/100) = 9.75, so use the 10th ordered observation Q3 = 204,000
  65. 65. Which Measure Do I Use? • Which measure of central tendency is most appropriate? – In general, the mean is preferred, since it has nice mathematical properties, we shall discuss later – The median and quartiles, are resistant to outliers • Consider the following three datasets – – – – 1, 2, 3 (median=2, mean=2) 1, 2, 6 (median=2, mean=3) 1, 2, 30 (median=2, mean=11) All have median=2, but the mean is sensitive to the outliers • In general, if there are outliers, the median is preferred to the mean ……….. To continue

×