Upcoming SlideShare
×

# Statistics with R

1,012 views
688 views

Published on

Sessions 1 to 3

3 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

Views
Total views
1,012
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
38
0
Likes
3
Embeds 0
No embeds

No notes for slide

### Statistics with R

1. 1. > x=11 > print(x) [1] 11 > x [1] 11 > X Error: object 'X' not found > y<-7 > y [1] 7 > y<-9 > y [1] 9 > ls() [1] "x" "y" > rm(y) > y Error: object 'y' not found > y<-9 > x.1<-14 > x.1 [1] 14 > 1x<-22 Error: unexpected symbol in "1x"
2. 2. Entering data with c • c function for small datasets – combines or concatenates terms together Example: we have a count of the number of typing mistakes of a word document: 02132011 To enter this into an R session we go like this: > typo=c(0,2,1,3,2,0,1,1) > typo [1] 0 2 1 3 2 0 1 1
3. 3. Learning Objectives • What is statistics? • Become aware of the varied applications of statistics in business. • Differentiate between descriptive and inferential statistics. • Identify types of variables.
4. 4. Statistics in Business • Accounting — auditing and cost estimation • Economics — local, regional, national, and international economic performance • Finance — investments and portfolio management • Management — human resources, compensation, and quality management • Management Information Systems — performance of systems which gather, summarize, and disseminate information to various managerial levels • Marketing — market analysis and consumer research • International Business — market and demographic analysis
5. 5. What is Statistics? • Science dealing with collection, analysis, interpretation and presentation of data (with a view to making inferences) • Branches of statistics: – Descriptive – graphical or numerical summaries of data – Inferential – making a decision based on data
6. 6. What is Statistics? Statistics in business is the study of VARIATIONS
7. 7. Population Versus Sample • Population — the whole – a collection of all persons, objects, or items under study • Census — gathering data from the entire population • Sample — gathering data on a subset of the population – Use information about the sample to infer about the population
8. 8. Population Versus Sample
9. 9. Population and Census Data Identifier Color MPG RD1 Red 12 RD2 Red 10 RD3 Red 13 RD4 Red 10 RD5 Red 13 BL1 Blue 27 BL2 Blue 24 GR1 Green 35 GR2 Green 35 GY1 Gray 15 GY2 Gray 18 GY3 Gray 17
10. 10. Sample and Sample Data Identifier Color MPG RD2 Red 10 RD5 Red 13 GR1 Green 35 GY2 Gray 18
11. 11. Population Versus Sample Select a random sample
12. 12. Parameter vs. Statistic • Parameter — descriptive measure of the population – Usually represented by Greek letters  denotes population parameter  2 denotes population variance  denotes population standard deviation • Statistic — descriptive measure of a sample – Usually represented by Roman letters x denotes sample mean s 2 denotes sample variance s denotes sample standarddeviation
13. 13. Statistics in Business • Inferences about parameters made under conditions of uncertainty (which are always present in statistics) – Uncertainty can be caused by • Randomness in selection of a sample • Lack of knowledge about the source of the inferences • Change in conditions not accounted for
14. 14. Variables and Data Variable : a characteristic of any entity being studied – is capable of taking on different values that can be used for analysis e.g. stock price, ROI, market share, age of worker, income of a family, total sales, advertising cost etc Measurement : is done when a standard process is used to assign numbers to particular characteristics of a variable – may be obvious or defined e.g. age is obvious but ROI or Labour productivity is defined The source of each measurement is called a Sampling unit Data : recorded measurements
15. 15. Levels of Data Measurement What are 40 and 80? may represent Weights of two objects being shipped Ratings received in a consumer test by two different products Football jersey numbers of a fullback and centreforward Appropriateness of data analysis depends on the level of measurement of the data gathered
16. 16. Levels of Data Measurement • Nominal — Qualitative data, typically numbers are used only to classify or categorize the attribute, however it is useful to retain original verbal descriptions of categories – 1 for “male” and 2 for “female” – Employee identification number – Religion, Geographic location, PIN code, Place of birth – Demographic questions in survey etc
17. 17. Levels of Data Measurement • Ordinal - A variable is ordinal measurable if ranking or ordering is possible for values of the variable. – For example, a gold medal reflects superior performance to a silver or bronze medal in the Olympics. But can you say a gold and a bronze medal average out to a silver medal? – Preference scales are typically ordinal – how much do you like this cereal? Like it a lot, somewhat like it, neutral, somewhat dislike it, dislike it a lot.
18. 18. Levels of Data Measurement • Interval - In interval measurement the distance between attributes does have meaning. – Numerical data typically fall into this category – For example, when measuring temperature (in Fahrenheit), the distance from 30-40 is same as the distance from 70-80. The interval between values is interpretable.
19. 19. Levels of Data Measurement • Ratio — in ratio measurement there is always a reference point that is meaningful (either 0 for rates or 1 for ratios) – This means that you can construct a meaningful fraction (or ratio) with a ratio variable. – In applied social research most "count" variables are ratio, for example, the number of clients in past six months.
20. 20. Visualizing the data • Construct a frequency distribution – For both grouped and ungrouped data • Construct graphical summaries of qualitative data • Construct graphical summaries of quantitative data • Construct graphical summaries of two variables
21. 21. Ungrouped vs.Grouped Data • Ungrouped data – have not been summarized in any way – are also called raw data • Grouped data – logical groupings of data exists • i.e. age ranges (20-29, 30-39, etc.) – have been organized into a frequency distribution
22. 22. Example of Ungrouped Data 42 26 32 34 57 30 58 37 50 30 53 40 30 47 49 50 40 32 31 40 52 28 23 35 25 30 36 32 26 50 55 30 58 64 52 49 33 43 46 32 61 31 30 40 60 74 37 29 43 54 Ages of a sample of Managers from Urban Child Care Centres in US
23. 23. Frequency Distribution • Frequency Distribution – summary of data presented in the form of class intervals and frequencies – Vary in shape and design – Constructed according to the individual researcher's preferences
24. 24. Frequency Distribution • Steps in Frequency Distribution – Step 1 – Determine range of frequency distribution • Range is the difference between the high and the lowest numbers – Step 2 – Determine the number of classes • Do not use too many, or two few classes – Step 3 – Determine the width of the class interval • Approx. class width can be calculated by dividing the range by the number of classes • Values fit into only one class
25. 25. Frequency Distribution of Child Care Manager’s Ages Class Interval Frequency 20-under 30 6 30-under 40 18 40-under 50 11 50-under 60 11 60-under 70 3 70-under 80 1
26. 26. Relative Frequency Relative frequency is the proportion of the total frequency that is in any given class interval in a frequency distributionrtion of the total frequency that is any given class interval in a frequency distribution. Class Interval 20-under 30 30-under 40 40-under 50 50-under 60 60-under 70 70-under 80 Total Frequency 6 18 11 11 3 1 50 Relative Frequency 6 .12  50 .36 18 .22  50 .22 .06 .02 1.00
27. 27. Cumulative Frequency Cumulative frequency is a running total of frequencies through the classes of a frequency distributionen class interval in a frequency distribution. Class Interval 20-under 30 30-under 40 40-under 50 50-under 60 60-under 70 70-under 80 Total Frequency 6 18 11 11 3 1 50 Cumulative Frequency 6 24 18 + 6 35 11 + 24 46 49 50
28. 28. Cumulative Relative Frequencies Cumulative relative frequency is a running total of the relative frequencies through the classes of a frequency distributione total frequency Cumulative Relative Cumulative Relative Class Interval Frequency Frequency Frequency Frequency 20-under 30 6 .12 6 .12 30-under 40 18 .36 24 .48 40-under 50 11 .22 35 .70 50-under 60 11 .22 46 .92 60-under 70 3 .06 49 .98 70-under 80 1 .02 50 1.00 Total 50 1.00
29. 29. Common Statistical Graphs – Quantitative Data • • • • • Histogram -- vertical bar chart of frequencies Frequency Polygon -- line graph of frequencies Ogive -- line graph of cumulative frequencies Dot Plots – each data value is plotted Stem and Leaf Plot -- Like a histogram, but shows individual data values. Useful for small data sets.
30. 30. Histogram • A histogram is a graphical summary of a frequency distribution • Labeling x-axis with class endpoints and y-axis with frequencies, drawing a horizontal line between two class endpoints at each frequency value • The number and location of rectangles (bars) should be determined based on the sample size and the range of the data
31. 31. Data Range 42 26 32 34 57 30 58 37 50 30 53 40 30 47 49 50 40 32 31 40 52 28 23 35 25 30 36 32 26 50 55 30 58 64 52 49 33 43 46 32 61 31 30 40 60 74 37 29 43 54 Range = Largest - Smallest = 74 - 23 = 51 Smallest Largest
32. 32. Number of Classes and Class Width • The number of classes should be between 5 and 15. – Fewer than 5 classes cause excessive summarization. – More than 15 classes leave too much detail. • Class Width – Divide the range by the number of classes for an approximate class width – Round up to a convenient number
33. 33. Class midpoint or Class mark The midpoint of each class interval is called the class midpoint or the class mark.
34. 34. Midpoints for Age Classes Class Interval 20-under 30 30-under 40 40-under 50 50-under 60 60-under 70 70-under 80 Total Frequency 6 18 11 11 3 1 50 Midpoint 25 35 45 55 65 75 Relative Frequency .12 .36 .22 .22 .06 .02 1.00 Cumulative Frequency 6 24 35 46 49 50
35. 35. Midpoints for Age Classes Class Interval 20-under 30 30-under 40 40-under 50 50-under 60 60-under 70 70-under 80 Total Frequency 6 18 11 11 3 1 50 Midpoint 25 35 45 55 65 75 Relative Frequency .12 .36 .22 .22 .06 .02 1.00 Cumulative Frequency 6 24 35 46 49 50
36. 36. Histogram Class Interval Frequency 20-under 30 6 30-under 40 18 40-under 50 11 50-under 60 11 60-under 70 3 70-under 80 1
37. 37. 10 0 A graphical display of class frequencies Frequency Class Interval Frequency 20-under 30 6 30-under 40 18 40-under 50 11 50-under 60 11 60-under 70 3 70-under 80 1 20 Frequency Polygon 0 10 20 30 40 50 60 70 80 Years
38. 38. Relative Frequency Ogive Cumulative Relative Class Interval Frequency 20-under 30 .12 30-under 40 .48 40-under 50 .70 50-under 60 .92 60-under 70 .98 70-under 80 1.00
39. 39. Stem and Leaf plot: Safety Examination Scores for Plant Trainees Raw Data Stem Leaf 86 77 91 60 55 2 3 76 92 47 88 67 3 9 23 59 72 75 83 4 79 5 569 6 07788 77 68 82 97 89 81 75 74 39 67 7 0245567789 79 83 70 78 91 8 11233689 68 49 56 94 81 9 11247
40. 40. Stem and Leaf plot: Construction Raw Data 86 77 91 60 Stem 55 Leaf 2 3 3 9 4 79 5 569 Leaf 6 07788 67 7 0245567789 78 91 8 11233689 Leaf 94 81 9 11247 76 92 47 88 23 59 72 75 77 68 82 97 81 75 74 39 79 83 70 68 49 56 Stem Stem 67 83 89
41. 41. Histogram vs. Stem and Leaf? • So, which one should you use? • A Stem and Leaf plot is useful for small data sets. It shows the values of the datapoints. • A histogram foregoes seeing the individual values of the data for the bigger picture of the distribution of the data • The purpose of these graphs is to summarize a set of data. As long as that need is met, either one is okay to use.
42. 42. Common Statistical Graphs – Qualitative Data • Pie Chart -- proportional representation for categories of a whole • Bar Chart – frequency or relative frequency of one more categorical variables
43. 43. Complaints by Amtrak Passengers COMPLAINT NUMBER PROPORTION DEGREES Stations, etc. 28,000 .40 144.0 Train Performance Equipment 14,700 .21 75.6 10,500 .15 54.0 Personnel 9,800 .14 50.4 Schedules, etc. Total 7,000 .10 36.0 70,000 1.00 360.0
44. 44. Complaints by Amtrak Passengers
45. 45. Second Quarter U.S. Truck Production Second Quarter Truck Production in the U.S. (Hypothetical values) Company 2d Quarter Truck Production A 357,411 B 354,936 C 160,997 D 34,099 E Totals 12,747 920,190
46. 46. Second Quarter U.S. Truck Production
47. 47. Pie Chart Calculations for Company A Company 2d Quarter Truck Production Proportion Degrees A 357,411 .388 140 B 354,936 .386 139 C 160,997 .175 63 D 34,099 .037 13 12,747 920,190 .014 1.000 5 360 E Totals
48. 48. Vertical Bar Graphs or Column Charts 6 5 4 Kolkata 3 Mumbai Chennai 2 1 0 2010 2011 2012 2013
49. 49. Horizontal Bar Chart 2013 2012 Chennai Mumbai 2011 Kolkata 2010 0 2 4 6
50. 50. Pareto Chart A pareto chart is a bar chart, sorted from the most frequent to the least frequent, overlaid with a cumulative line graph (like an ogive). These data present the most common types of defects. 100% 90% 80 70 Frequency 100 90 80% 70% 60 50 40 60% 50% 40% 30 20 30% 20% 10 0 10% 0% Poor Wiring Short in Coil Defective Plug Other
51. 51. Scatter Plot Registered Vehicles (1000's) Gasoline Sales (1000's of Gallons) 5 60 15 120 9 90 15 140 7 60
52. 52. Common Statistical Graphs – Comparing Two Variables • Scatter Plot -- type of display using Cartesian coordinates to display values for two variables for a set of data. – The data is displayed as a collection of points, each having the value of one variable determining the position on the horizontal axis and the value of the other variable determining the position on the vertical axis. – A scatter plot is also called a scatter chart, scatter diagram and scatter graph.
53. 53. Measures of Central Tendency & Dispersion: Learning Objectives • Distinguish between measures of central tendency, measures of variability, measures of shape, and measures of association. • Understand the meanings of mean, median, mode, quartile, percentile, and range. • Compute mean, median, mode, percentile, quartile, range, v ariance, standard deviation, and mean absolute deviation on ungrouped data. • Differentiate between sample and population variance and standard deviation.
54. 54. Measures of Central Tendency & Dispersion: Learning Objectives - continued • Understand the meaning of standard deviation as it is applied by using the empirical rule and Chebyshev’s theorem. • Compute the mean, median, standard deviation, and variance on grouped data. • Understand box and whisker plots, skewness, and kurtosis. • Compute a coefficient of correlation and interpret it.
55. 55. Measures of Central Tendency: Ungrouped Data • Measures of central tendency yield information about “the centre, or middle part, of a group of numbers.” • Measures of central tendency do not focus on the span of the data set or how far values are from the middle numbers • Common Measures of Location – – – – – Mode Median Mean Percentiles Quartiles
56. 56. Mode • Mode - the most frequently occurring value in a data set – Applicable to all levels of data measurement (nominal, ordinal, interval, and ratio) – Can be used to determine what categories occur most frequently – Sometimes, no mode exists (no duplicates) • Bimodal – In a tie for the most frequently occurring value, two modes are listed • Multimodal -- Data sets that contain more than two modes
57. 57. Median • Median - middle value in an ordered array of numbers. – Half the data are above it, half the data are below it – Mathematically, it is the (n+1)/2 th ordered observation • For an array with an odd number of terms, the median is the middle number – n=11 => (n+1)/2 th = 12/2 th = 6th ordered observation • For an array with an even number of terms the median is the average of the middle two numbers – n=10 => (n+1)/2 th = 11/2 th = 5.5th = average of 5th and 6th ordered observation
58. 58. Arithmetic Mean • • • • Mean is the average of a group of numbers Applicable for interval and ratio data Not applicable for nominal or ordinal data Affected by each value in the data set, including extreme values • Computed by summing all values in the data set and dividing the sum by the number of values in the data set
59. 59. Demonstration Problem The number of U.S. cars in service by top car rental companies in a recent year according to Auto Rental News follows. Company / Number of Cars in Service Enterprise 643,000; Hertz 327,000; National/Alamo 233,000; Avis 204,000; Dollar/Thrifty 167,000; Budget 144,000; Advantage 20,000; U-Save 12,000; Payless 10,000; ACE 9,000; Fox 9,000; Rent-A-Wreck 7,000; Triangle 6,000 Compute the mode, the median, and the mean.
60. 60. Demonstration Problem • Solutions Solution Mode: 9,000 (two companies with 9,000 cars in service) Median: With 13 different companies in this group, N = 13. The median is located at the (13 +1)/2 = 7th position. Because the data are already ordered, median is the 7th term, which is 20,000. Mean: μ = ∑x/N = (1,791,000/13) = 137,769.23
61. 61. Percentile • Percentile - measures of central tendency that divide a group of data into 100 parts • At least n% of the data lie at or below the nth percentile, and at most (100 - n)% of the data lie above the nth percentile • Example: 90th percentile indicates that at 90% of the data are equal to or less than it, and 10% of the data lie above it
62. 62. Calculating Percentiles • To calculate the pth percentile, – Order the data – Calculate i = N (p/100) – Determine the percentile • If i is a whole number, then use the average of the ith and (i+1)th ordered observation • Otherwise, round i up to the next highest whole number
63. 63. Quartiles • Quartile - measures of central tendency that divide a group of data into four subgroups • Q1: 25% of the data set is below the first quartile • Q2: 50% of the data set is below the second quartile • Q3: 75% of the data set is below the third quartile Q2 Q1 25% 25% Q3 25% 25%
64. 64. Quartiles for Demonstration Problem For the cars in service data, n=13, so Q1: i = 13 (25/100) = 3.25, so use the 4th ordered observation Q1 = 9,000 Q3: i = 13 (75/100) = 9.75, so use the 10th ordered observation Q3 = 204,000
65. 65. Which Measure Do I Use? • Which measure of central tendency is most appropriate? – In general, the mean is preferred, since it has nice mathematical properties, we shall discuss later – The median and quartiles, are resistant to outliers • Consider the following three datasets – – – – 1, 2, 3 (median=2, mean=2) 1, 2, 6 (median=2, mean=3) 1, 2, 30 (median=2, mean=11) All have median=2, but the mean is sensitive to the outliers • In general, if there are outliers, the median is preferred to the mean ……….. To continue