- 1. Matthews Lazaro MSc Biostatistics DESCRIPTIVE STATISTICS KAMUZU COLLEGE OF NURSING
- 2. Basic Definitions Statistics is the science that deals with the collection, classification, analysis, interpretation and presentation of numerical facts or data. Data Collection Sources of data are many, the clinical area is one where measurements from patients could be a data source. There are variables that could be measured such as length of stay in the ward for patients, age of patients, types of diseases or conditions, distance travelled to the health facility etc. For example
- 3. Example The following data could be collected from under- five children ward on the length of stay by patients ; 2 days, Brown; 7 days Black, 0.5 days
- 4. Sample and Population Symbols As we progress in this course there will be different symbols that represent the same thing. The only difference is that one comes from a sample and one comes from a population.
- 5. Symbols under this topic Sample Mean: Sample variance :s2 Sample Standard Deviation:s Population Mean: Population variance: σ2 Population Standard deviation:σ x
- 6. Classification Normally when data is collected, it is raw i.e. it is not processed. For example the data collected on length of stay in the under-five ward is raw data. One can present this data in groups called classes e.g. 0 - 5 days, 6-10 days, 11-15 days etc Each class will have corresponding frequencies Data presented in classes and corresponding frequencies is called frequency distribution.
- 7. Example No. of days in Ward in days (Class) No. of patients (Frequency) 0-5 1 6-10 8 11-15 15 16-20 9 21-25 5 26-30 2 Total 40
- 8. This data needs to be analyzed and presented in a form that could easily be understood by most people who may not know the intricacies of data analysis
- 9. Interpretation Data analysis and interpretation is the process of assigning meaning to the collected information and determining the conclusions, significance, and implications of the findings. In a situation where there has been an intervention, the purpose of the data analysis and interpretation phase is to transform the data collected into credible evidence about the performance of say an intervention.
- 10. For the frequency distribution above, the analysis and interpretation of measures of central tendency such as the mean, measures of spread such as the standard deviation etc
- 11. Presenting data in diagrams and charts Quantitative data is usually presented in figures and tables (a) Bar Chart Used for discrete data. The categories on the x- axis are not linked. Table 1 shows hypothetical colours of eyes for patients in a hospital. Table 1: Frequency Distribution of eyes Colour of eyes No of Patients Black 11 White 3 Red 14 Brown 25 Blue 5
- 12. Figure 1
- 13. Pie Chart A pie chart (or a circle chart) is a circular statistical graphic, which is divided into sectors to illustrate numerical proportion. The Pie Chart may be used for both continuous as well as discrete data.
- 14. Figure 2
- 15. (c) Histogram A Histogram is a graphical display of data using bars of different heights. It is similar to Bar Chart only that a Histogram is used to display continuous data and hence the bars touch each other. A histogram is a very important chart and is used in many situation in statistics hence details of its construction are discussed in later sections but basically a histogram looks as in Figure 2
- 16. Figure 3
- 17. Types of data Data refers to the information that has been collected from an experiment or a survey/research, or some historical record. Collected statistical data falls into one of two categories, discrete data or continuous data Discrete data is a set of data values which occupies only whole number values, often a count or score Example; number of patients admitted in a ward etc
- 18. Continuous data is any data that has infinite values with connected data points, often a measurement. Continuous data will occupy both whole number as well as fractional parts. Examples of continuous data include; height of a person (e.g. 1.72m; 1 is the whole number part while 0.72 is the fractional part), baby birth weight, distance covered in a race etc.
- 19. Data that is collected may be presented raw or grouped As an example, 100 birth weights for babies born at a clinic in Chiradzulu were presented raw as follows; 3.1 3.3 1.3 2.9 2.2 3.4 4.1 5.1 4.9 4.0 5.2 1.8 2.1 3.2 2.2 3.3 2.4 3.4 2.5 3.1 2.6 3.2 2.7 4.0 3.3 2.8 4.1 1.1 2.9 3.5 4.2 1.9 3.6 3.0 2.1 2.2 3.8 2.3 3.4 4.6 4.7 3.4 3.5 3.7 3.8 2.7 2.9 2.8 3.1 3.3 3.4 2.6 3.5 4.8 4.6 4.3 2.6 3.2 2.7 4.0 3.3 2.8 4.1 1.1 2.9 3.5 4.2 1.9 3.6 3.0 2.1 2.2 3.1 3.3 1.3 2.9 2.2 3.4 4.1 5.1 4.9 4.0 5.2 1.8 2.1 3.2 2.2 3.3
- 20. We can organize this data into five classes as shown in Table 1; Class Frequency 1.1-2.0 9 2.1-3.0 33 3.1-4.0 38 4.1-5.0 16 5.1-6.0 4 Total 100
- 21. Although the baby weights are presented to one place of decimal, it is possible that some of the weights were accurate to two places of decimal Suppose a baby’s weight were 3.06kg in which class would we place that weight? It would not be in the class 2.1 – 3.0 because 3.06 is larger than 3.0. It would also not be in the class 3.1 – 4.0 because 3.06 is less than 3.1
- 22. This therefore means than the classes above have gaps in them to which we would have many babies unrecorded. The classes with gaps are called class limits In order to eliminate the gaps between the classes we introduce what are called Class Boundaries we firstly identify the gap between the classes in the Class Limits In the case above, the gaps are 0.1 each i.e. from 3.0 to in the second class to 3.1 in third class, the difference is 0.1
- 23. If you divide this gap by 2 and use that to stretch each class you end up with class boundaries For example 0.1/2=0.05 Then the class 2.1-3.0 will be stretched by 0.05 resulting into 2.05-3.05 The next class will be 3.05-4.05 and so on
- 24. Table Class Boundaries Frequency 1.05-2.05 9 2.05-3.05 33 3.05-4.05 38 4.05-5.05 16 5.05-6.05 4 Total 100
- 25. The value that is at the centre of the Class Boundary is called the Class Mid-point such that; int 2 Upper Class Boundary Lower Class Boundary Class Mid Po
- 26. Descriptive Statistics Descriptive statistics are numbers or data that are used to summarize and describe data. Descriptive statistics tend to summarize a sample in order to get an idea about the population The main features of the sample are also the main features of a population.
- 27. Measures of Central Tendency A measure of central tendency is a value used to represent the typical or “average” value in a data set There are 4 values that are considered measures of the center. 1. Mean 2. Median 3. Mode
- 28. Measures of Central Tendency for raw data Suppose you are weighing babies born at your clinic somewhere in Malawi, and the baby weights (in kg) of the first 10 babies were as follows: 2.7, 3, 3.0, 4.1, 5.2, 1.9, 2.3, 3.0.3.3, 3.0 What single figure could represent the baby weights at this clinic? Lets see how different measures of central tendency are computed.
- 29. The mode The mode is the data value or datum (or value) which appears the largest number of times in the set or the most frequently occurring figure in the set If no data value is repeated, we say there is no mode. Using the following data set; 2.7kg, 3.4kg, 3.0kg, 4.1kg, 5.2kg, 1.9kg, 2.3kg, 3.0kg, 3.3kg, 3.0kg. The mode is 3.0kg (highest frequency)
- 30. The Median The median is defined as the middle figure after the data set is ranked or placed in order of magnitude. Example 22, 29, 35, 24, 26, 15, 28, 36, 45, 21, 33, 5, 46, 21, 19, 41, 5, 84, 58, 63, 5, 23 Find the median. Solution Rank the data in ascending order 5, 5, 5, 15, 19, 21, 21, 22, 23, 24, 26, 28, 29, 33, 35, 36, 41, 45, 46, 58, 63, 84
- 31. The pick the two middle numbers (because the set is even) 5, 5, 5, 15, 19, 21, 21, 22, 23, 24, 26, 28, 29, 33, 35, 36, 41, 45, 46, 58, 63, 84 The two middle figures are 26 and 28. The average of these two figures is the median i.e. (26+28)/2 = 27 is the median.
- 32. The Arithmetic Mean The Arithmetic Mean is the sum of all data values divided by the number of values in the data set The mean of a sample data set is denoted by .. The mean of a population data set is denoted by .. x
- 33. Mean is given by 1 n i i x x n Where n is number of observation, i runs from 1 to n Example Use the following data set to compute a sample mean 1,65kg, 3.3kg, 4.1kg, 3.0kg, 3.1kg 2.9kg 2.8kg, 3.2 kg, 3.0kg, 3.0kg
- 34. 1.65 3.3 4.1 3 3.1 2.9 2.8 3.2 3 3 x 3.005 10 kg
- 35. Measures of Central Tendency for grouped data The Mode When data is presented in a frequency distribution, the mode is not found by inspection. The mode for grouped data may be found by using two methods: (a) Graphically (b) analytical (use of a formula) Finding the Mode graphically Consider the weights of the 100 babies born at Mbulumbuzi Health Centre.
- 36. Worked Example Class Limits Class Boundaries Frequency (f) 1.10-1.50 1.05-1.55 1 1.60-2.00 1.55-2.05 10 2.10-2.50 2.05-2.55 14 2.60-3.00 2.55-3.05 21 3.10-3.50 3.05-3.55 30 3.60-4.00 3.55-4.05 13 4.10-4.50 4.05-4.55 6 4.60-5.00 4.55-5.05 3 5.10-5.50 5.05-5.55 2 Table Frequency distribution with Class Boundaries. The class boundaries are plotted on the x – axis while on the y – axis the class frequencies are plotted.
- 37. Figure…… weights of the babies born at Mbulumbuzi Health Centre.
- 38. How to determine the mode. 1st step ; identify the modal class (3.05-3.55) 2nd step; identify the frequency of the class before and after the modal class on the chart (2.55- 3.05 and 3.55 – 4.05) These should be identified on the chart as shown in the subsequent figures
- 39. Figure 1.1 Figure 1.2
- 40. In Figure 1.1, the frequency for the class before the modal class is represented by the point A (corner), The frequency for the modal class is represented by the positions B and C and the frequency for the class after the modal class is represented by the point D. Note that if the frequency of the class before the modal class is higher than that of the class after the modal class, the position (value of the Mode) of the mode is closer to the lower class boundary of the modal class as is the case in Figure 1.2,
- 41. Finding the Mode analytically 1 1 2 * D Mode L C D D Where; L : is the lower class boundary of the modal class, D1: is the frequency of the class before the modal class, D2: is the frequency of the class after the modal class and C : is the class width of the modal class.
- 42. The Median Definition – the median is the value which separates the largest 50% of data values from the lowest 50% or the middle value after the data is ranked. Just like the mode, the median may be found using two main methods; i.e. a. Graphically b. Analytical (use of a formula)
- 43. Table …… Class Limits Class Boundaries Frequency (f) “Or less” Cumulative frequency “Or more” Cumulative frequency 1.10-1.50 1.05-1.55 1 0 100 1.60-2.00 1.55-2.05 10 1 99 2.10-2.50 2.05-2.55 14 11 89 2.60-3.00 2.55-3.05 21 25 75 3.10-3.50 3.05-3.55 30 46 54 3.60-4.00 3.55-4.05 13 76 24 4.10-4.50 4.05-4.55 6 89 11 4.60-5.00 4.55-5.05 3 95 5 5.10-5.50 5.05-5.55 2 98 2 >5.55 100 0
- 44. Finding the Median graphically We shall first look at a new frequency distribution called the cumulative frequency distribution. This where the class frequencies are cumulated from 0 to the total frequency or ∑f or from ∑f to 0.
- 45. How to compute the cumulative frequencies The less cumulative frequency 1st step: By asking questions about the lower class boundary as follows; How many people had a value of 1.05 or less? The answer is zero (0) 2nd step: By asking questions about the upper class boundary as follows; How many people had a value of 1.55 or less? The answer is one (1) which is the frequency for the class 0.95 – 1.55
- 46. 3rd step: Next is how many people had values of 2.05 or less? Answer is 11 which is the 10 in the class 1.55 – 2.05 and the 1 in the class 0.95 – 1.55. You continue like that!!!! The “or more” cumulative frequency distribution is found in a similar manner. The cumulative frequency distribution is used to plot a chart called the Ogive or the Cumulative Frequency Curve.
- 47. Figure ……
- 48. In this case there were 100 babies, so the value of the 50th baby can be read on the x-axis which is the Median.
- 49. Finding the Median analytically The Median is found by; 2 * b N Cf Median L C f L : is the lower class boundary of the median class, N : is the total frequency, f : is the frequency of the median class Cfb : is the cumulative frequency of the class before the median class and C : is the class width of the median class. …
- 50. The Median class is the class in which the median will be found. It is the class in which the half-way member is It can be found by using the cumulative frequencies to identify where the half-way member is.
- 51. The arithmetic mean For grouped data, the arithmetic mean has to take into consideration the frequencies as well as the class size. For each class, the value that represents the class is the class midpoint. This value will be the one which now will have the stated frequency
- 52. Table …… Class Limits Class Boundaries Midpoint (x) frequency (f) fx 1.10-1.50 1.05-1.55 1.3 1 1.3 1.60-2.00 1.55-2.05 1.8 10 18 2.10-2.50 2.05-2.55 2.3 14 32.2 2.60-3.00 2.55-3.05 2.8 21 58.8 3.10-3.50 3.05-3.55 3.3 30 99 3.60-4.00 3.55-4.05 3.8 13 49.4 4.10-4.50 4.05-4.55 4.3 6 25.8 4.60-5.00 4.55-5.05 4.8 3 14.4 5.10-5.50 5.05-5.55 5.3 2 10.6 Σƒ=100 Σfx=309.5
- 53. Class midpoint (x )= (Upper class boundary + Lower class boundary)/2 The total (sum of values is obtained by adding up the fx column The mean for grouped data is obtained by dividing this total by the sum of frequencies. Arithmetic mean for grouped data is given by fx x f
- 54. For the data above, 095 . 3 100 5 . 309 f fx x
- 56. Dispersion The measure of the spread or variability No Variability – No Dispersion
- 57. Measures of Variation There are 2 values used to measure the amount of dispersion or variation. (The spread of the group) 1. Range 2. Standard Deviation
- 58. Why is it Important? You want to choose the best brand of medicine for your patients. You are interested in how long the drugs take to cure a disease. The choices are narrowed down to 2 different drugs. The results are shown in the chart. Which drug would
- 59. The chart indicates the number of days a drug takes to cure a particular disease. Drug A Drug B 10 35 60 45 50 30 30 35 40 40 20 25 210 210
- 60. Does the Average Help? Drug A: Avg = 210/6 = 35 days Drug B: Avg = 210/6 = 35 days They both last 35 days to cure a disease. No help in deciding which to buy.
- 61. Consider the Spread Drug A: Spread = 60 – 10 = 50 days Drug B: Spread = 45 – 25 = 20 days Drug B has a smaller variability which means that it performs more consistently. Choose drug B.
- 62. Range The range is the difference between the lowest value in the set and the highest value in the set. Range = High # - Low #
- 63. Example Find the range of the data set. 40, 30, 15, 2, 100, 37, 24, 99 Range = 100 – 2 = 98
- 64. Deviation from the Mean A deviation from the mean, x – x bar, is the difference between the value of x and the mean x bar. We base our formulas for variance and standard deviation on the amount that they deviate from the mean.
- 65. Formulae for sample and population variances Definition /Computation formula Machine Formulae 1 ) ( 2 2 2 n n x x s 2 2 1 ( ) 1 n i i x x S n 2 2 1 ( ) N i i x N 2 2 2 ( ) i x x N N
- 66. Standard Deviation The standard deviation is the square root of the variance. 2 s s
- 67. Example – Using Formula Find the variance of the following dataset 6, 3, 8, 5, 3 (in hours) 6 36 3 9 8 64 5 25 3 9 x 2 x 25 x 143 2 x
- 69. Find the standard deviation The standard deviation is the square root of the variance. 12 . 2 5 . 4 s
- 70. Standard deviation for grouped data For grouped data, the standard deviation has to take into account the class frequencies, the class width as well as the value of the mean. The mean , is calculated as stated earlier; x fx x f
- 71. Worked Example Class Limits Frequency (f) 1.10-1.50 1 1.60-2.00 10 2.10-2.50 14 2.50-3.00 21 3.10-3.50 30 3.60-4.00 13 4.10-4.50 6 4.60-5.00 3 5.10-5.50 2 Total 100 Table Frequency distribution with Class Boundaries. Compute the standard deviation of grouped data presented in the table below.
- 72. Worked Example Class Limits Class Boundaries Class Midpoint (x) Frequency (f) 1.10-1.50 1.05-1.55 1.3 1 1.60-2.00 1.55-2.05 1.8 10 2.10-2.50 2.05-2.55 2.3 14 2.60-3.00 2.55-3.05 2.8 21 3.10-3.50 3.05-3.55 3.3 30 3.60-4.00 3.55-4.05 3.8 13 4.10-4.50 4.05-4.55 4.3 6 4.60-5.00 4.55-5.05 4.8 3 5.10-5.50 5.05-5.55 5.3 2 Table Frequency distribution with Class Boundaries. For each class, the value that represents the class is the class midpoint. 100 f
- 73. Worked Example Class Limits Class Boundaries Class Midpoint (x) Frequen cy (f) Deviance (x- 1.10-1.50 1.05-1.55 1.3 1 -1.795 1.60-2.00 1.55-2.05 1.8 10 -1.295 2.10-2.50 2.05-2.55 2.3 14 -0.795 2.50-3.00 2.55-3.05 2.8 21 -0.295 3.10-3.50 3.05-3.55 3.3 30 0.205 3.60-4.00 3.55-4.05 3.8 13 0.705 4.10-4.50 4.05-4.55 4.3 6 1.205 4.60-5.00 4.55-5.05 4.8 3 1.705 5.10-5.50 5.05-5.55 5.3 2 2.205 100 f x
- 74. From the table above, There is need to get 2 ( ) f x x 2 ( ) 0.6572945 0.8107 f x x f Then variance can be computed as below Standard deviation can be computed as below 2 2 ( ) 65.72945 0.6572945 100 f x x f
- 75. The better formula for computation is ; 2 2 fx x f
- 76. Interquartile Range • The interquartile range tells you the spread of the middle half of your distribution. • Quartiles segment any distribution that’s ordered from low to high into four equal parts. • The interquartile range (IQR) contains the second and third quartiles, or the middle half of your data set.
- 78. Remember the range gives you the spread of the whole data set, the interquartile range gives you the range of the middle half of a data set
- 79. Calculation of IQR The interquartile range is found by subtracting the Q1 value from the Q3 value
- 80. Formula Explanation IQR = interquartile range Q3 = 3rd quartile or 75th percentile Q1 = 1st quartile or 25th percentile
- 81. Q1 is the value below which 25 percent of the distribution lies, while Q3 is the value below which 75 percent of the distribution lies. You can think of Q1 as the median of the first half and Q3 as the median of the second half of the distribution.
- 82. Methods for finding the interquartile range Although there’s only one formula, there are various different methods for identifying the quartiles. You’ll get a different value for the interquartile range depending on the method you use. Here, we will discuss two of the most commonly used methods. These methods differ based on how they use the median.
- 83. Exclusive method vs inclusive method The exclusive method excludes the median when identifying Q1 and Q3, the inclusive method includes the median in identifying the quartiles. Remember! The procedure for finding the median is different depending on whether your data set is odd- or even-numbered.
- 84. When you have an odd number of data points, the median is the value in the middle of your data set. You can choose between the inclusive and exclusive method. With an even number of data points, there are two values in the middle, so the median is their mean. It’s more common to use the exclusive method in this case.
- 85. There is little consensus on the best method for finding the interquartile range, the exclusive interquartile range is always larger than the inclusive interquartile range.
- 86. The exclusive interquartile range may be more appropriate for large samples, while for small samples, the inclusive interquartile range may be more representative because it’s a narrower range
- 87. Steps for the exclusive method Even-numbered data set (n=10) Step 1: Order your values from low to high.
- 88. Step 2: Locate the median, and then separate the values below it from the values above it .
- 89. Step 3: Find Q1 and Q3. Q1 is the median of the first half and Q3 is the median of the second half. Since each of these halves have an odd number of values, there is only one value in the middle of each half.
- 91. Step 4: Calculate the interquartile range.
- 92. Odd-numbered data set (n=11) Step 1: Order your values from low to high.
- 93. Step 2: Locate the median, and then separate the values below it from the values above it.
- 94. Step 3: Find Q1 and Q3.
- 95. Step 4: Calculate the interquartile range.
- 96. Steps for the inclusive method Almost all of the steps for the inclusive and exclusive method are identical. The difference is in how the data set is separated into two halves. The inclusive method is sometimes preferred for odd-numbered data sets because it doesn’t ignore the
- 97. n=11 Step 1: Order your values from low to high
- 98. Step 2: Find the median.
- 99. Step 2: Separate the list into two halves, and include the median in both halves.
- 100. Step 3: Find Q1 and Q3.
- 101. Step 4: Calculate the interquartile range.
- 102. When is the interquartile range useful? The interquartile range is an especially useful measure of variability for skewed distributions. For these distributions, the median is the best measure of central tendency because it’s the value exactly in the middle when all values are ordered from low to high. The IQR is also useful for datasets with outliers. Because it’s based on the middle half of the distribution, it’s less influenced by extreme values.
- 103. Visualize the interquartile range in boxplots A boxplot, or a box-and-whisker plot, summarizes a data set visually using a five-number summary.
- 104. Every distribution can be organized using these five numbers: Lowest value Q1: 25th percentile Median Q3: 75th percentile Highest value (Q4)
- 106. The vertical lines in the box show Q1, the median, and Q3, while the whiskers at the ends show the highest and lowest values.
- 107. In a boxplot, the width of the box shows you the interquartile range. A smaller width means you have less dispersion, while a larger width means you have more dispersion
- 108. An inclusive interquartile range will have a smaller width than an exclusive interquartile range. Boxplots are especially useful for showing the central tendency and dispersion of skewed distributions.
- 109. The placement of the box tells you the direction of the skew. A box that’s much closer to the right side means you have a negatively skewed distribution. A box closer to the left side tells you that you have a positively skewed distribution.
- 112. Introduction to Probability A Probability Experiment is a process which leads to well-defined results called outcomes. For example, the toss of a coin is a probability experiment because it leads to results called outcomes such as “Heads” and “Tails”. There so many such probability experiments such as about the toss of two coins, the roll of a die etc The set of all possible outcomes from these probability experiments and others is called a Sample Space
- 113. For example If a coin is tossed, the sample space is {H,T} If flipping two coins, the sample space is {HH, HT, TH, TT} Event is one or more outcomes of a probability experiment Getting a “Head” in a toss of a coin is an event. Getting “Heads” on both tosses of two coins is an event
- 114. Probability is defined as the likelihood of an event happening. The probability of an event E, denoted P(E) is a definition of how likely that event is to happen. This definition is usually numerical. The value of the probability of any event is always between zero and one inclusive
- 115. Two main approaches to probability 1. The Classical Approach 2. Empirical Approach Classical approach/definition to Probability The Classical definition of the probability of the event E is defined as the number of ways or times the event E occurs divided by the number of all possible outcomes including the event E.
- 116. Mathematically, this can be expressed as follows: s in ( ) l s Thenumber of times or way which the Event E occurs P E The tota number of All possible Outcome including the event E
- 117. Example If a doctor sees 10 patients with malaria, 5 patients with diarrhoea, 15 patients respiratory problems and 20 patients with skin diseases, he will have seen 50 patients on the day. If he needs to interview, at random, one of the patients seen on the day to give him an indication of how his service was, what is the probability that the patient to be interviewed will have skin diseases?
- 118. (Skin Disease) l Thenumber patients with skin disease P The tota number of All patients (Skin Disease) l 20 50 0.4 Thenumber patients with skin disease P The tota number of All patients
- 119. Empirical approach Empirical probability is based on past observations. The empirical probability of an event is the relative frequency of a frequency distribution based upon past observations. The definition of the empirical probability of any event E is the number of times the event E occurred in the past divided by the total number of times the experiment was carried out
- 120. Mathematically, s in ( ) l exp Thenumber of times or way which the Event E occured P E The tota number of timesthe eriment was carried out
- 121. Limiting values of probability When the probability of an event is zero (0), the event is said to be an absolute impossibility i.e. there is absolutely no way the event can happen When the probability of an event is one (1), the event is said to be an absolute certainty 0 ( ) 0 P E
- 122. Class to suggest events in life whose probability is zero Class to suggest events in life whose probability is one.
- 123. Counting Rules 1 Factorials Definition: Factorial 4 ! = 4 x 3 x 2 x 1 and 7! = 7 x 6 x 5 x 4 x 3 x 2 x 1 2. PERMUTATION RULES Definition: )! ( ! r n n P r n 6720 4 5 6 7 8 1 2 3 1 2 3 4 5 6 7 8 )! 5 8 ( ! 8 60 3 4 5 1 2 1 2 3 4 )! 3 5 ( ! 5 5 8 3 5 x x x x x x x x x x x x x P x x x x x x P Example
- 125. Probability Laws Consider a bag containing coloured marbles; 10 black, 5 red, 5 blue and 3 yellow, then the probability of picking a green marble from this bag is 0 because there are no green marbles in the bag. What is the probability of picking? A black marble? A yellow marble? A marble that is not black?
- 126. Lets compute the probabilities 23 10 ) ( ) ( black P marbles of number Total appears marble black a ways or times of Number black P 23 13 ) ( ) ( black Not P marbles of number Total appears marble black non a ways or times of Number black Not P
- 127. The above results indicate that P(black)=10/23 and P(not black)=13/23 are complementary. They add up to 1 i.e. P(Black) + P(not Black) = 1. This shows that the sum of all probabilities in the sample space is 1 and also giving the basic rule of probability which says that the probability of an event occurring plus the probability of the event not occurring is equal to 1.
- 128. P(E) + P(not E) = 1 The Addition law of probabilities
- 129. A pack off cards has 52 cards (excluding the Jokers). The cards are in two basic colours, black and red. Of the 52 cards, half (26 cards) are red while the other half are black. The picture above shows the 52 cards. Flowers (13) and Spades (13) are black as shown above while Hearts (13) and Diamonds (13) are red. Each deck of cards has an Ace (the cards on the far left). The probability of pick a Heart ) ( ) ( ) ( 52 13 ) ( flower P spade P diamond P heart P
- 130. 52 26 ) (Re pack a in cards of number Total cards red of Number Card d P Note that the event "Red Card" is a compound event i.e. it contains other events. The event "Red Card" is actually the event "Hearts" or "Diamonds" i.e. 52 13 52 13 52 26 ) ( ) (Re Diamonds or Hearts P Card d P
- 131. This observation is actually true for any two events which are mutually exclusive. Events are said to be mutually exclusive if they both cannot happen at the same time. If two events are mutually exclusive, then the probability of either event occurring is the sum of the probabilities of each occurring. This is called the Addition Law of probabilities for mutually exclusive events.
- 132. In general therefore, if two events A and B are mutually exclusive, then the probability of event A or B happening is sum of the individual probabilities i.e ) ( ) ( ) ( B P A P B or A P Example 2. What is the probability of picking a Spade or a Heart from a pack of cards? Example 3 What is the probability of picking a Spade or an Ace from a pack of cards?
- 133. Note that the two events “Spade” and “Ace” are not mutually exclusive because they both can happen at the same time, i.e. there is a card that is both an Ace and a Spade. The card is the Ace of Spades. ) ( ) ( ) ( ) ( B and A P B P A P B or A P
- 134. The Multiplication law of probabilities Consider the toss of a coin. The probability of getting a “Head” when a coin is tossed is 0.5. Suppose one wants to have two tosses. Is there a difference in outcomes if one person tosses twice compared to two people tossing once? Why? The discussion will have shown that tossing a coin twice by the same person is the same as two people tossing a coin once each. The reason is that, as far as outcomes are concerned, the result of the first toss is independent of the result of the second toss when one person tosses a coin twice.
- 135. In general, events are said to be independent if the occurrence of one event does not affect the occurrence of the other in any way or two events are independent if the occurrence of one does not change the probability of the other occurring. Consider the toss of two coins; what is the probability of getting “Heads” on both tosses?
- 136. COIN A COIN B H H H T T H T T There are 4 possible outcomes when two coins are tossed (HH, HT, TH and TT). Out the four possible outcomes, only one has Heads (H) on both Coin A and Coin B.
- 137. The probability of getting "Heads" on both coins when two coins are tossed is P (Head and Head) = but = x P(Head and Head) = x In general, if events A and B are independent, the probability of event A and event B happening is given by; P(A and B) = P(A) x P(B) 1 4 1 4 1 2 1 2 1 4 1 2 1 2
- 138. There are 20 marbles in the bag of which 6 are red, 2 are blue and the rest are white. What is the probability of picking a white ball from the bag? The number of times white marbles occur in the bag The total number of all possible outcomes including the white marbles . . . . . . . . . . . . . . . . . . . 12 20 P(E) = =
- 139. If two marbles were to be picked from the bag with replacement, what is the probability that both marbles would be white? The answer to this question is from the multiplication law of probabilities i.e. (A and B) = P (A) x P (B) However if the marbles are picked without replacement the situation would be different. The probability of picking a white marble the first time would
- 140. The probability of picking a white marble the first time would remain because the number of white marbles is 12 and the total number of marbles in the bag is 20. When the first marble is picked and then not put back in the bag, the total number of marbles in the bag reduces to 19. The probability of picking the colour of the marble that has been taken out of the bag. If the marble taken out of the bag from the first pick is white, then the probability of a white marble the second time around is 12 20 11 19
- 141. If the marble taken out of the bag from the first pick is not white, then the probability of a white marble the second time around is This therefore means that there are two possible solutions to the probability of picking two white marble without replacement; P(White and White) = x if the marble in the first pick was white. P(White and White)= x if the marble in the first pick was not white. 12 19 12 20 11 19 12 20 12 19
- 142. In other words, the probability of picking a white marble the second time is dependent on the result of the first pick. We say that the probability of picking a white marble the second time is conditional on the result of the first experiment. In general, if two events A and B are not independent, i.e. the occurrence of one event does affect the probability of the other occurring (the events are dependent) the probability of both events happening is given by; P(A and B)=P(A) x P(B|A)
- 143. The probability of event B occurring given that event A has already occurred is read "the probability of B given A" and is written: P(B|A) this is the conditional probability of B given that the event A has already occurred and we have the result of that experiment.
- 144. Probability tree diagrams Calculating probabilities can sometimes be confusing. It may not be easy to tell when to use the addition law, the multiplication law or a combination of these. The probability tree diagram is a tool that can be used to simplify otherwise complex looking probability problems. A tree diagram is simply a way of representing a sequence of events which are a set of combinations of all possible outcomes from a situation. A Tree diagram helps us to see all possible outcomes of an event at a glance and simplifies
- 145. Example A hospital procurement department advertised for three contracts for the supply of gloves worth hundreds of thousands (Contract A), laboratory equipment worth millions (Contract B) and a dialysis machine worth tens of millions (Contract C). A supply company bids for the three contracts. The probability of getting contract A is 0.85. The probability of getting contract B depends on whether they get contract A or not. The probability of getting contract B if they get A is 0.9 but only 0.2 if they fail to get contract A. The probability of getting contract C depends on whether they get contract B.
- 146. It is 0.95 if they get B but only 0.1 if they fail to get contract B after getting contract A. If they fail to get A and get B, the probability of getting contract C is 0.6. If they fail to get A and fail to get B, they are not allowed to bid for contract C.
- 147. Draw a tree diagram to illustrate the probabilities of the outcomes. What is the probability of? Getting all three contracts Getting two contracts only Getting only one contact Getting no more than two contracts Getting at least two contracts
- 148. Getting at most two contracts Getting no contract at all Getting contract B but not contract C Getting contract C but not contract Getting contract A but not the other contracts
- 149. Probability Distributions The right way is to start by introducing probability density functions and that will lead to aspects of calculus such as integrals which would put off many. I will introduce probability distributions as the distribution or break up of the total probability of 1 into several possible events or outcomes. As an example, the tree diagram has several branches which are events or outcomes. The total of probabilities from all branches is 1 but is distributed into several events
- 151. The total probability= 0.72675+0.03825+0.0085+0.0765+0.018+0.012+ 0.12 = 1 The total probability of 1 is distributed into seven different events or outcomes. The seven events are; (i) get A, get B and get C (ii) get A, get B and fail to get C (iii) get A, fail to get B, get C (iv) get A, fail to get B, fail to get C (v) fail to get A, get B and get C (vi) fail to get A, get B and fail to get C (vii) fail to get A and fail to get B
- 152. The probabilities of each of the seven events are presented on the ends of each branch of the tree diagram above.
- 153. A listing of all the values a random variable can assume with their corresponding probabilities make a probability distribution. For example, the toss of a coin: Expected Outcome (X) Head Tail Total Probability (X) 1/2 1/2 1 The total probability is 1.
- 154. In many other situations, the total probability will have to be distributed into several events or outcomes (leading to fractions which will eventually have to add up to 1). This is basically the whole concept of probability distributions.
- 155. A random variable does not mean that the values can be anything (a random number) Random variables have a well defined set of outcomes and well defined probabilities for the occurrence of each outcome. For example, if you toss a coin, the known outcomes are Heads and Tails and the probability of each is 0.5 only that when a coin is to be tossed, the outcome is not known, it can be any hence the term random.
- 156. Similarly, when a die is rolled, the known outcomes are 1, 2, 3, 4, 5 and 6; the probabilities of each event are also known to be 1/6 but when the die is being rolled, any outcome can appear. The random refers to the fact that the outcomes happen by chance -- that is, you don't know which outcome will occur next.
- 157. Here's an example of a probability distribution that results from the rolling of a single fair die. X 1 2 3 4 5 6 sum P(x) 1/6 1/6 1/6 1/6 1/6 1/6 6/6=1
- 158. The Binomial Probability Distribution The binomial distribution is one of the discrete probability distributions. It is discrete because the outcomes of the binomial experiments result in whole number form other than fractional. Binomial experiments find probabilities of whole number items and not fractional ones
- 159. Binomial Experiment A binomial experiment has the following; 1. A fixed number of trials 2. Each trial is independent of the others 3. There are only two outcomes 4. The probability of each outcome remains constant from trial to trial. These can be summarized as: An experiment with a fixed number of independent trials, each of which can only have two possible outcomes.
- 160. Examples of Binomial Experiments Tossing a coin 6 times to see how many tails occur. There is a fixed number of tosses, i.e. 6. Each toss has two possible outcomes. Each toss is independent of the other and results of each toss do not affect the results of the other tosses. The probability of getting a Head or Tail is the same throughout the 6 tosses. Asking 20 people if they watch Television Malawi (TVM). You ask a fixed number of people i.e. 20. There are two possible outcomes, either they watch or they
- 161. Rolling a die 5 times to see if a 5 appears. The outcomes from tossing of coins can be arranged in a triangular pattern deliberately. The pattern gives us a clue as to how we can have outcomes for 5 coins and more!. Observe the coefficients of the outcomes. We shall isolate them and present them as follows;
- 162. Tossing of 4 coins. No. Of coins Outcomes 1 1 1 2 1 2 1 3 1 3 3 1 4 1 4 6 4 1
- 163. You will observe that each coefficient is the sum of two coefficients above it! Such that for 5 coins, we can come up with the coefficients as follows; 1 5 10 10 5 1 For 6 coins, the coefficients will be; 1 6 15 20 15 6 1 etc.
- 164. The outcomes start with all successes on the left, reduce by one every step and end with all failures on the right. The other observation is that the number of coins is the second coefficient. The other thing to note is that the coefficients are symmetrical, whatever is on the left is the same on the right. This triangle is called Pascal’s triangle in honour of Blaise Pascal, a French mathematician who discovered it.
- 165. If the probability of success is denoted p and the probability of failure q then the outcomes may be presented in terms of probabilities as follows; No. Of coins Probabilities 1 p q 2 p2 2pq q2 3 p3 3p2q 3pq2 q3 4 p4 4p3q 6p2q2 4pq3 q4
- 166. For each experiment (coin), the total probability is always equal to 1 i.e. p + q = 1 p2 + 2pq + q2 = 1 p3 + 3p2q + 3pq2 + q3 = 1 . p4 + 4p3q + 6p2q2 + 4pq3 +q4 = 1etc
- 167. From your mathematics in secondary school, you will recall the expansion of binomials such as (x+y)2 , (a+b)3 etc. If you expand (a+b), (a+b)2, (a+b)3, (a+b)4...., the coefficients of the terms are exactly the same as the ones in Pascal’s triangle such that we can use this property for probabilities i.e., for any n binomial trials or experiments whereby the probability of success is p and the probability of failure is q, the probability distribution of the n experiments is given by:
- 168. ..... ! 3 ) 2 )( 1 ( ! 2 ) 1 ( ) ( 3 3 2 2 1 q p n n n q p n q np p q p n n n n n Example: What is the probability of rolling exactly two sixes in 6 rolls of a die? There are five basic things you need to do to work a binomial problem like this one.
- 169. 1. Firstly define Success. Success in this case must be for a single trial. Success = "Rolling a 6 on a single die" 2. Define the probability of success p: p = 1/6 3. Find the probability of failure which is 1 - p: q = 5/6 4. Define the number of trials: n = 6 5. Define the number of successes out of those trials: x = 2
- 170. ..... ! 3 ) 2 6 )( 1 6 ( 6 ! 2 ) 1 6 ( 6 ) ( 3 3 6 2 2 6 1 6 6 6 q p q p q p p q p We need the term containing p2 which is the probability of two successes. The term is; 4 4 6 ! 4 ) 3 6 )( 2 6 )( 1 6 ( 6 q p
- 172. Apart form using knowledge of Pascal’s Triangle, we can use the knowledge of counting rules Example: What is the probability of rolling exactly two sixes in 6 rolls of a die?
- 173. 1.Firstly define Success. Success in this case must be for a single trial. Success = "Rolling a 6 on a single die" 2. Define the probability of success p: p = 1/6 3. Find the probability of failure which is 1 - p: q = 5/6 4. Define the number of trials: n = 6 5. Define the number of successes out of those trials: x = 2
- 175. Example: A coin is tossed 10 times. What is the probability that exactly 6 heads will occur.
- 176. Mean, Variance and Standard Deviation
- 177. Example: Find the mean, variance, and standard deviation for the number of sixes that appear when rolling 30 dice.
- 178. Normal Distribution • Bell shaped. • Gaussian curve” after the mathematician Karl Friedrich Gauss.
- 179. • Normal distributions are symmetric around their mean. • The mean, median, and mode of a normal distribution are equal and located at the peak. • The area under the normal curve is equal to 1.0. • Normal distributions are denser in the center and less dense in the tails. Properties of a Normal Distribution
- 180. This is to say that the normal probability distribution is asymptotic - the curve gets closer and closer to the x-axis but never actually touches. Normal distributions are defined by two parameters, the mean (μ) and the standard deviation (σ). Properties of a Normal Distribution
- 181. 68% of the area of a normal distribution is within one standard deviation of the mean. Approximately 95% of the area of a normal distribution is within two standard deviations of the mean. Properties of a Normal Distribution
- 182. Properties of a Normal Distribution The parameters μ and σ are the mean and standard deviation, respectively, and define the normal distribution. The symbol e is the base of the natural logarithm and π is the constant pi. 2 1 ( ) 2 1 ( ) 2 x f x e The density of the normal distribution (the height for a given value on the x- axis) is shown below.
- 183. Empirical Rule • Approximately 68 % of the data lies in the interval Figure 1. Empirical Rule
- 184. Empirical Rule Example 1: Figure 2 shows a normal distribution of age of patients with a mean of 50yrs and a standard deviation of 10. The shaded area is between 40yrs and 60yrs. What proportion of distribution does the area contain. Figure 2. Normal distribution of age of patients
- 185. Empirical Rule Example 2: A normal distribution of concentration of glycogen in the blood has a mean of 75mg and a standard deviation of 10. The shaded area on the normal distribution graph extends from 55.4mg to 94.6mg.
- 186. a. How many standard deviations are within the shaded area? b. Using Empirical rule, approximate the proportion of the shaded area under the curve.
- 187. Standard Normal Distribution i i x z The standard score and the standardized variable For a population, the standard score (also called the normal deviate, or z score or z value) is defined as: and for a sample it is indicated as i i x x z s
- 188. Standard Normal Distribution The standard score (z) shows how far any given data value is from the mean of the distribution in standard deviation units; how many standard deviations the value is from the mean. i x
- 189. When for any variable X, each measurement value in a sample or population is transformed into a z value, this process is known as standardizing (or normalizing) the variable, and the resulting variable Z is called a standardized variable. Standard Normal Distribution
- 190. Standard Normal Distribution Example 3: Assuming the following sample follows normal distribution, first calculate and s, and then standardize the sample to have a standard normal distribution: 3, 5, 7, 9, and 11.
- 191. Standard Normal Distribution Solution: 35 7 5 i x x n 2 2 2 ( ) 5(285) (35) 3.16228 ( 1) 5(4) i i n x x s n n
- 192. Standard Normal Distribution 1 2 3 4 5 3 7 1.2649 3.16228 5 7 0.6325 3.16228 7 7 0 3.16228 9 7 0.6325 3.16228 11 7 1.2649 3.16228 i i i i i x x z s x x z s x x z s x x z s x x z s Having determined s and , we can proceed and compute z score for each observation. x x
- 193. Finding Areas under the Standard Normal Distribution curve Standard Normal Cumulative Probability Table provides the cumulative distribution function for values of z rounded to the nearest hundredth.
- 194. This table provides the area under the standard normal curve for values of z less than those identified in the table. This is illustrated in the figure on the right with the shaded region, labelled probability.
- 195. Figure: Area under the curve
- 196. The table below demonstrates how to use the table to find the area under the standard normal curve that lies to the left of Z value. Lets suppose Z= 1.46. Notice that the value 1.46 = 1.4 + .06. The value 1.4 is found by scrolling down the first column of the table and the value .06 is found by moving right across the top row.
- 197. The intersection within the table of the row of 1.4 and the column of .06 is the value .9279. This is the area under the normal curve to the left of Z = 1.46.
- 198. Table 1. Standard Normal Cumulative Probability Table
- 199. Often times, we are interested in finding the Z-score that corresponds to a given area under the standard normal curve. The process involves searching the array of area values and working backwards to find the Z-score
- 200. Example 4: Using the tables, find the Z-score that corresponds to an area of 0.9050 under the standard normal curve to the left of the Z- score. When searching the array of values, the closes one we see is .9049. This value is in the row of 1.3 and the column of .01. Thus, the Z-
- 201. Table 2. Standard Normal Cumulative Probability Table
- 202. Exercises 1. Use Tables to find the following areas under the standard normal curve. 1. The area that lies to the left of Z = -0.58. 2. The area that lies between Z = -1.16 and Z = 2.71. 3. The area that lies to the right of Z = 0.31.
- 203. Exercises 2. 1. Find the Z-score so that the area to the left of the Z-score is 0.10. 2. Find the Z-score so that the area to the right of the Z-score is 0.0735.
- 204. We are often interested in finding the Z-score that has a specified area to the right. For this reason, we have special notation to represent this situation
- 205. The notation Pronounced as Z sub alpha is the Z-score such that the area under the standard normal curve to the right of is Find the value of z z 05 . 0 z
- 206. This means that the area under the curve is 0.05 and we need to find the corresponding values. Since our tables indicate areas of z scores to the left, let’s find the area of curve to the left of the z score i.e 1-0.05=0.95
- 207. Now let’s find the z score corresponding to the 0.95. From the tables, the corresponding z value is 1.65
- 208. as a probability distribution curve Recall that the area under the standard normal distribution can be interpreted as either a probability or as the proportion of the population with the given characteristic. When interpreting the area under the standard normal curve as a probability, we use the following notation Notation for the Probability of a Standard Normal Random Variable P(a < Z < b) represents the probability that a standard normal random variable is between a and b
- 209. P(Z > a) represents the probability that a standard normal random variable is greater than a. P(Z < a) represents the probability that a standard normal random variable is less than a. Example 5: Let Z denote a sample of glucose amount in the blood of patients which follows a normal distribution with a mean of 0 and standard deviation of 1. a. Find P (Z > 2). b. Find P (Z ≤ 1.73).
- 210. Solution: Since μ=0 and σ=1, the value of 2 is actually z=2 standard deviations above the mean. Proceed down the first (z) column in standard normal tables and read the area opposite z=2.0. This area denoted by the symbol P(z), is P(2.0)= 0.9772. But this is the probability to the left of z score. For P(Z > 2)=1-0.9772=0.0228. Therefore P(Z > 2)=0.0228 Z=1.73, therefore P(1.73)=0.9582 Therefore P(Z < 1.73)=0.9582
- 211. Example 6: The achievement scores for a college entrance examination are normally distributed with mean 75 and standard deviation 10. What fraction of the scores lies between 80 and 90? Solution The desired fraction of the population is given by the area between 5 . 1 10 75 90 5 . 0 10 75 80 2 1 z and z
- 212. P(0.5 < z < 1.5)=P(0.5)-P(1.5)=0.3085- 0.0668=0.2417 Therefore the fraction of the scores lying between 80 and 90 is 0.2417
- 213. Exercises 3. Let X denote a normal random variable with mean 0 and standard deviation 1. Find P(−2 ≤ Z ≤ 2). The grade point averages (GPAs) of a large population of Public Health College students are approximately normally distributed with mean 2.4 and standard deviation 0.8. If students possessing a GPA less than 1.9 are dropped from college, what percentage of the students will be dropped?
- 214. The weekly amount of money spent on cleaning the city was observed, over a long period of time, to be approximately normally distributed with mean $400 and standard deviation $20. How much should be budgeted for weekly cleaning to provide that the probability the budgeted amount will be exceeded in a given week is only 0.1?
- 215. Suppose a clinically accepted value for mean systolic blood pressure in males aged 20 to 24 years is 120 mmHg and the standard deviation is 20 mmHg. a). If a 22 year old male is selected at random from the population, what is the probability that his systolic blood pressure is equal to
- 216. INFERENTIAL STATISTICS (Sampling and Estimation)
- 217. Statistical inference is the estimation of the population parameters such as the population mean, the population proportion etc. derived from the analysis of a sample drawn from that population. A sample is a small part of the population which is used to analyse as an example of the character, features or qualities of the population.
- 218. Sampling is the process of selecting a sample of people or products from a population which is to be used as a representative of the population of interest. An estimate is an approximate calculation of something and estimation is the process of coming up with an estimate of a population parameter. There are several sampling methods which are important for you to know in order to appreciate the process of sampling and estimation, the methods are briefly described below and you should take time to read around them from other
- 219. Sampling Methods Probability sampling Non probability sampling In the probability sample every member of the wider population has an equal chance of being included in the sample; inclusion or exclusion from the sample is a matter of chance and nothing else. In the non-probability sample some members of the wider population definitely will be excluded and others definitely included (i.e. every member of the wider population does not have an equal chance of being included in the sample)
- 220. Types of Probability Sample 1. Simple random sampling Each member of the population under study has an equal chance of being selected and the probability of a member of the population being selected is unaffected by the selection of other members of the population. One problem associated with this particular sampling method is that a complete list of the population is needed and this is not always readily available
- 221. 2. Systematic Sampling It involves selecting subjects from a population list in a systematic rather than a random fashion. For example, if from a population of, say, 2,000, a sample of 100 is required, then every twentieth person can be selected. The starting point for the selection is chosen at random.
- 222. 3. Stratified random sample Stratified sampling involves dividing the population into homogenous groups, each group containing subjects with similar characteristics. A stratified random sample is, therefore, a useful blend of randomization and categorization, thereby enabling both a quantitative and qualitative piece of research to be undertaken.
- 223. 4. Cluster sampling It involves the sampling of successively smaller units Conditions for doing cluster sampling 1. The sampling frame can not be identified 2. Direct contacts needs to be made with the sample units, but these are scattered around a wide geographical area
- 224. Cluster sampling is an example of 'two-stage sampling' or 'multistage sampling': in the first stage a sample of areas is chosen; in the second stage a sample of respondents within those areas is selected.
- 225. Multistage sampling Multistage sampling is a complex form of cluster sampling in which two or more levels of units are embedded one in the other. The first stage consists of constructing the clusters that will be used to sample from. In the second stage, a sample of primary units is randomly selected from each cluster (rather than using all units contained in all selected clusters). In following stages, in each of those selected clusters, additional samples of units are selected, and so on.
- 226. All ultimate units (individuals, for instance) selected at the last step of this procedure are then surveyed. This technique, thus, is essentially the process of taking random samples of preceding random samples.
- 227. Non probability samples 1. Convenience (Accidental/Opportunity) Sampling It involves choosing the nearest individuals to serve as respondents and continuing that process until the required sample size has been obtained The researcher simply chooses the sample from those to whom she has easy access. As it does not represent any group apart from itself, it does not seek to generalize about the wider population
- 228. 2. Quota Sampling A quota sample strives to represent significant characteristics (strata) of the wider population and it sets out to represent these in the proportions in which they can be found in the wider population. For example, suppose that the wider population (however defined) were composed of 55% females and 45% males, then the sample would have to contain 55% females and 45% males
- 229. 3. Purposive Sampling In purposive sampling, researchers handpick the cases to be included in the sample on the basis of their judgement of their typicality. In this way, they build up a sample that is satisfactory to their specific needs Assumptions for one to use purposive sampling: 1. They possess the necessary knowledge 2. They have relevant experience 3. They are part of the social structure or process on which the research is intended to focus
- 230. 4. Snowball Sampling A researchers identify a small number of individuals who have the characteristics in which they are interested. These people are then used as informants to identify, or put the researchers in touch with, others who qualify for inclusion and these, in turn, identify yet others This method is useful for sampling a population where access is difficult, maybe because it is a sensitive topic or where communication networks are undeveloped
- 231. What sample size do I need?” The answer to this question is influenced by a number of factors, including: the purpose of the study, population size, the risk of selecting a “bad” sample and the allowable sampling error. Data analysis plan e.g number of cells one will have in cross tabulation Most of all whether undertaking a qualitative or quantitative study
- 232. Sample size determination in qualitative study Probability sampling not appropriate as sample not intended to be statistically representative But, sample should have ability to represent salient characteristics in population. Sample size taken until point of theoretical saturation
- 233. ……. Sample size is usually small to allow in-depth exploration and understanding of phenomena under investigation Ultimately a matter of judgement and expertise in evaluating the quality of information against final use, research methodology , sampling strategy and results is necessary. In practice, qualitative sampling usually requires a flexible, pragmatic approach.
- 234. ….. The researcher actively selects the most productive sample to answer the research question. This can involve developing a framework of the variables that might influence an individual's contribution and will be based on the researcher's practical knowledge of the research area, the available literature and evidence from the study itself. • This is a more intellectual strategy than the simple demographic stratification of epidemiological studies, though age, gender and social class might be important variables.
- 235. ……. If the subjects are known to the researcher, they may be stratified according to known public attitudes or beliefs It may be advantageous to study a broad range of subjects : • (maximum variation sample) • outliers (deviant sample) • subjects who have specific experiences (critical case sample) • subjects with special expertise (key informant sample).
- 236. ……. The iterative process of qualitative study design means that samples are usually theory driven ( theoretical sampling) to a greater or lesser extent
- 237. Some suggestions of sample size in qualitative studies The smallest number of participants should be 15 Should lie under 50 6-8 participants for FGDs AND at least 2 FGDs per population group IMPORTANT Attainment of saturation Justification of choice of number
- 238. Sample size determination in quantitative study Several criteria will need to be specified to determine the appropriate sample size: Level of precision, Level of confidence or risk, Degree of variability in the attributes being measured ( prevalence) External validity
- 239. ……. The Level of Precision-sometimes called sampling error range in which the true value of the population is estimated to be. This range is often expressed in percentage points (e.g., ±5 percent). The Confidence Level based on ideas encompassed under the Central Limit Theorem. E.g a 95% confidence level is selected, 95 out of 100 samples will have the true population value within the range of precision
- 240. ……. Degree of Variability refers to the distribution of attributes in the population. The more heterogeneous a population, the larger the sample size required to obtain a given level of precision. The less variable (more homogeneous) a population, the smaller the sample size.
- 241. …… A proportion of 50 % indicates a greater level of variability than either 20% or 80%. This is because 20% and 80% indicate that a large majority do not or do, respectively, have the attribute of interest. Because a proportion of 0.5 indicates the maximum variability in a population, it is often used in determining a more conservative sample size, that is, the sample size may be larger than if the true variability of the population attribute were used.
- 242. …… Sample size affects accuracy of representation; Larger sample means less chance of error Minimum suggested sample is 30 and upper limit is 1,000 External validity – how well sample generalizes to the population, a representative sample is required (not the same thing as variety in a sample)
- 243. Strategies for Determining Sample Size There are several approaches to determining the sample size. Using a census for small populations Imitating a sample size of similar studies Using published tables Applying formulas to calculate a sample size
- 244. Using a Census for Small Populations …. One approach is to use the entire population as the sample. Although cost considerations make this impossible for large populations. Attractive for small populations (e.g., 200 or less). Eliminates sampling error and provides data on all the individuals in the population. Some costs such as questionnaire design and developing the sampling frame are “fixed,” that is, they will be the same for samples of 50 or 200. Finally, virtually the entire population would have to be sampled in small populations to achieve a desirable level of precision
- 245. Using a Sample Size of a Similar Study Use the same sample size as those of studies similar to the one you plan( Cite reference). Without reviewing the procedures employed in these studies you may run the risk of repeating errors that were made in determining the sample size for another study. However, a review of the literature in your discipline can provide guidance about “typical” sample sizes that are used.
- 246. Using Published Tables Published tables provide the sample size for a given set of criteria. Necessary for given combinations of precision, confidence levels and variability. The sample sizes presume that the attributes being measured are distributed normally or nearly so. Although tables can provide a useful guide for determining the sample size, you may need to calculate the necessary sample size for a different combination of levels of precision, confidence, and variability.
- 247. Sample Size for ±5%, ±7% and ±10% Precision Levels where Confidence Level Is 95% and P=.5. Size of Populatio n Sample Size (n) for Precision (e) of: ±5% ±7% ±10% 100 81 67 51 125 96 78 56 150 110 86 61 175 122 94 64 200 134 101 67 225 144 107 70 250 154 112 72 275 163 117 74 300 172 121 76 325 180 125 77 350 187 129 78 375 194 132 80 400 201 135 81 425 207 138 82 450 212 140 82
- 248. Using Formulas to Calculate a Sample Size Sample size can be determined by the application of one of several mathematical formulae. Formula mostly used for calculating a sample for proportions. For example: For populations that are large, the Cochran (1963:75) equation yields a representative sample for proportions. Fisher equation, Mugenda etc
- 249. Cochran equation Where n0 is the sample size, Z2 is the abscissa of the normal curve that cuts off an area α at the tails; (1 – α) equals the desired confidence level, e.g., 95%); e is the desired level of precision, p is the estimated proportion of an attribute that is present in the population,and q is 1-p. The value for Z is found in statistical tables which contain the area under the normal curve. e.g Z = 1.96 for 95 % level of confidence 2 2 2 / 0 e pq z n
- 250. ….. A Simplified Formula For Proportions Yamane (1967:886) provides a simplified formula to calculate sample sizes. ASSUMPTION: 95% confidence level P = .5 ;
- 251. …….. Where n is the sample size, N is the population size, e is the level of precision.
- 252. Finite population correction for proportions With finite populations, correction for proportions is necessary If the population is small then the sample size can be reduced slightly. This is because a given sample size provides proportionately more information for a small population than for a large population. The sample size (n0) can thus be adjusted using the corrected formulae
- 253. ….. Where n is the sample size N is the population size. no is calculated sample size for infinite population
- 254. Note The sample size formulae provide the number of responses that need to be obtained. Many researchers commonly add 10 % to the sample size to compensate for persons that the researcher is unable to contact. The sample size also is often increased by 30 % to compensate for non-response ( e.g self administered questionnaires).
- 255. Use of software in sample size determination Depending on type of study and specific software Some information will be required: Population sample size, population standard deviation, population sampling error, confidence level, z –value, power of study etc … 80% power in a clinical trial means that the study has a 80% chance of ending up with a p value of less than 5% in a statistical test (i.e. a statistically significant treatment effect) if there really was an important difference (e.g. 10% versus 5% mortality) between treatments.
- 256. Further considerations The above approaches to determining sample size have assumed that a simple random sample is the sampling design. More complex designs, e.g. case control studies etc , one must take into account the variances of sub- populations, strata, or clusters before an estimate of the variability in the population as a whole can be made.
- 257. Estimation Inferential statistics is the estimation of the population parameters from the sample statistics. The sample statistics are calculated from the sample data and the population parameters are inferred (or estimated) from the sample statistics. In estimation, we are concerned with unknown population parameters such as a population mean which is unknown but is required
- 258. Such situations force us to take samples, find sample statistics and use them to infer upon the unknown population parameters. We can estimate an unknown population parameter in two main ways; (i) By calculating a point estimate from the samples. A point estimate is a single value from the sample such as the sample mean used to estimate an unknown population parameter such as the population mean µ.
- 259. (ii) You can also calculate an interval estimate which is a range within which the unknown population parameter is expected to fall. Whether we find a point estimate or an interval estimate, in both cases, we are trying to find or estimate the value of an unknown population parameter. The estimator so found must satisfy three conditions:
- 260. (i) It must be unbiased: The expected value of the estimator must be equal to the population parameter, (ii) Consistent: The value of the estimator approaches the value of the parameter as the sample size increases, (iii) Relatively Efficient: The estimator has the smallest variance of all estimators which could be used.
- 261. Estimating a population mean Consider a population whose mean µ is unknown as illustrated by Figure below
- 262. Note that the large areas is the population while the smaller areas are samples taken from the population. In order to estimate µ, we will need to take samples from the population and calculate the sample means Each sample mean , is trying to estimate µ individually. However, note that the best estimate of µ is the mean of the sample means called the mean of the x
- 263. 1 2 3 4 ( ... ) n x x x x x x n For large n, x (the mean of the sampling distribution of means is equal to the population mean)
- 264. is a point estimate of the population mean μ. It is called a consistent estimator because its value gets closer to the population mean μ as the sample size n increases. Irrespective of the number of samples under consideration, a point estimate is likely to be different from its corresponding population parameter It is for this reason that interval estimates are preferred to point estimates. x
- 265. An interval estimate is a range within which an expected value is expected to fall. You may be asked to estimate the day when the first rains will fall this year. A point estimate would be to say the first rains will fall on 12th October. We are saying this estimate is unlikely to be correct however, by giving a range within which the date of the first rains falls would be a better estimate of the date when the first rains will fall.
- 266. The wider the range, the more the confidence that indeed the first rains will fall within that period. I can say, for example that the first rains will fall between 1st October and 31st January. From experience, first rains always fall after 1st October and way before 31st January. I can therefore say that I am 100% confident that the first rains will fall between 1st October and 31st January.
- 267. You will also agree this level of confidence because you know very well the rains don’t fall until way after 1st October and way before 31st January. If we are 100% confident (probability=1) then we can represent this on a normal curve.
- 268. Figure showing confidence level and Limits
- 269. The level of confidence is called the Confidence Level (100%) while the dates 1st October and 31st January are called the Confidence Limits. If I shift the confidence limits and ask what your level of confidence is that the first rains will come between 1st November and 31st December, you may not be 100% confident. The level of confidence drops because you know there are many years when the first rains have come in October.
- 270. You also know that first rains have sometimes come as late as after Christmas. Your level of confidence may therefore be say 96%. The 4% is the likelihood that you are wrong, that the first rains could come before 1st November and after 31st December.
- 271. Figure showing 96% confidence level, confidence limits and confidence interval for the day the first rains will fall in Malawi.
- 272. The above approach is also true for any unknown population parameter such as the population mean μ. A confidence interval is an interval estimate with a specific level of confidence. A level of confidence is the probability that the interval estimate will contain the parameter. In other words, it is the percent of the time the true mean will lie in the interval estimate given.
- 273. The confidence interval is therefore a range within which an unknown population parameter is expected to fall. The confidence limits are values within which the level of confidence is declared.
- 274. For the estimation of μ, the sample means will be different from each other and also from their mean As a result, the sample means will have a standard deviation about them This standard deviation is called the standard error (SE). The Central Limit Theorem states that irrespective of the distribution of the parent population (whether normal or not), the sampling distribution of means will be normally distributed (i.e. the sample means n x x x x x ... , , , 4 3 2 1 x x x x
- 275. known as the standard error. Figure showing sample means normally distributed.
- 276. This means we can use the normal distribution tables to determine the probability of value (sample mean) having any value of interest provided we know the mean of the distribution (mean of sample means, ) and the standard deviation of the distribution, x x
- 277. The standard error of the mean (SEM) is the standard deviation of the sampling distribution of means. It can also be viewed as the standard deviation of the error in the sample mean relative to the true mean, since the sample mean is an unbiased estimator of μ. SEM is usually estimated by the population standard deviation divided by the square root of the sample size: n or n SD x x
- 278. Where; σ is the standard deviation of the population. n is the size (number of observations) of the sample.
- 279. Figure showing the 95% confidence interval estimate for μ.
- 280. We are 95% confident that the population mean value from which the sample was taken falls somewhere between X1 and X2. Any confidence interval is given with a level of confidence which is given in percentage terms. You can have a 95% confidence interval or a 99% confidence interval or indeed any level of confidence. The level of confidence determines the number of standard deviations from the mean (Z) any sample mean value is from The value of Z is obtained from tables. For 95% level of confidence, the area at the centre is 0.95. x
- 281. You need to search for Inside the tables to be able to read the corresponding Z value. The value of Z for area=0.025 is ±1.96. The confidence interval is from X1 and X2, the 95% confidence interval is shown below. 025 . 0 2 % 5
- 283. The position X1 is 1.96 standard deviations (standard errors) less than the mean i.e . Similarly, the position on Figure 3.6 X2 is 1.96 standard errors more than the mean. This therefore means that; The 95% confidence interval Since we know that Then the 95% confidence interval for μ x x x X 96 . 1 1 x x x x to X 96 . 1 96 . 1 1 n x n to n X x x 96 . 1 96 . 1 1
- 284. The 95% confidence interval for μ is therefore, Similarly, we can find the 99%, 98% 90% etc confidence intervals for the population mean given data from samples taken from that population. n X x 96 . 1 1
- 285. Example As part of a malaria control programme it was planned to spray all 10 000 houses in a rural area with insecticide and it was necessary to estimate the amount that would be required. Since it was not feasible to measure all houses, a random sample of 100 houses was chosen and the sprayable surface of each of the these was measured.
- 286. The mean sprayable surface area for these 100 houses was 23.2 m2 and the standard deviation was 5.9m2. (a)Calculate the standard error about the estimate of the population mean . (b) What is a standard error? (c)The 95% confidence interval of the population mean was 22.0m2 to 24.4m2, what is a confidence interval? (d) What is the difference between a standard error and a standard deviation?
- 287. SOLUTION (a) The standard error of the population mean µ (b) A standard error is the standard deviation of the sampling distribution of means. It is related to the population standard deviation in this way (c) A confidence interval is a range within which an unknown population parameter is expected to fall. n x 2 59 . 0 100 9 . 5 m n x n x
- 288. (d) A standard error is the standard deviation of the sampling distribution of means which means on average how far away from their mean sample means are on average while a standard deviation is a measure of how far away from a mean a set of data is on average.
- 289. Estimating a population proportion One of the population parameters that need to be estimated is the population proportion p. If a population proportion such as the prevalence of a disease in the entire population is unknown, it may be estimated through sampling the population as discussed. The sample statistics are the best estimates of the unknown population proportion. The population proportion, ρ can be estimated from the sample proportion p.
- 290. The 95% confidence interval for the population proportion ρ is given by;
- 291. 95% Confidence interval for ρ Note that SE(ρ) is given by n p p z p ) 1 ( 2 / n p p ) 1 (
- 292. Example A health survey was carried out in Mangochi urban in 2014 among 123 adults chosen at random. The survey, among other things asked respondents when they last visited a sing’anga (an African medicine man). The answers revealed that 34 of them had not visited a sing’anga for over 2 years. (a) Calculate an estimate proportion of adults who had not visited a sing’anga for over two years.
- 293. (b) Find the 95% confidence intervals for the proportion adults who had not visited a sing’anga. What is the meaning of this confidence interval to you? (c)If a narrower confidence interval of this proportion was required, what would you recommend to the researchers? (d)What percentage of adults in Mangochi had visited a sing’anga in the past 2 years? Calculate the 98% confidence interval about this proportion.
- 294. Solutions a) Proportion of adults who had not visited a sing’anga for the past two years. b) Find the 95% confidence intervals for the proportion adults who had not visited a sing’anga. What is the meaning of this confidence interval to you? % 64 . 27 2764 . 0 123 34 or p
- 295. the 95% Confidence interval for 3554 . 0 1972 . 0 123 2764 . 0 1 ( 2764 . 0 96 . 1 2764 . 0 ) 1 ( 2 / to p p n p p z p p
- 296. It means that we have observed from the sample that 27.64% of adults did not visit a sing’anga for the past two years but if we were to deal with the whole population, the proportion of adults who would not have visited a sing’anga would be somewhere between 19.72% and 35.54% (c) If a narrower confidence interval of this proportion was required, what would you recommend to the researchers? In order to narrow the confidence interval (i.e. a more precise estimate) you need to increase the sample size. (d) What percentage of adults in Mangochi had
- 297. The percentage of adults who had visited a sing’anga
- 298. DATA COLLECTION AND MANAGEMENT Data collection is a major part of the research process. Methods and instruments for data collection must be chosen according to the nature of the problem, approach to the solution and variables being studied
- 299. Qualitative Data Collection Methods 1. Collecting verbal data Verbal data primarily consist of words resulting from various methodological approaches which are common that research participants speak about such as events, experiences, practices, and so on. This is achieved through interviews, focus group discussions and narratives
- 300. The three main methods of data collection 1. In-depth interviews (IDIs) Interviewing is often used in qualitative studies to elicit meaningful data. In interviews, the interviewer writes down responses verbatim or uses a tape-recorder for later transcription.
- 301. IDIs in qualitative research encourage subjects to express their views at length. The respondent is usually interviewed at a place convenient to them. An interview schedule, sometimes called an interview guide, is a list of topics administered to subjects by a skilled interviewer.
- 302. The researcher may be able to obtain more detailed information from each participant, but loses the richness that can arise in a group (FGD) in which people debate issues and exchange views. Example: Please describe your experiences on the day you were discharged from the hospital.
- 303. The interview helps reveal more about beliefs and attitudes and behaviour according to the respondent. IDIs normally use open-ended questions which permit free responses which should be recorded in the respondents’ own words. Such questions are useful for obtaining in- depth information on:
- 304. 1. Facts with which the researcher is not very familiar, 2. Opinions, attitudes and suggestions of informants, 3. Sensitive issues.
- 305. In order to have quality data with open ended questions there is need to 1. Thoroughly train and supervise the interviewers or select experienced research assistants. 2. Prepare a list of further questions to keep at hand to use to ‘probe’ for answer(s) in a systematic way
- 306. 3. Pre-test open-ended questions and, if possible, pre-categorise the most common responses, leaving enough space for other answers.
- 307. 2. Semi-structured interviews Semi-Structured Interviews allow participants to provide specific answers to questions in their own words. When open-ended questions are included in the data collection tool, respondents must write out their responses. The focus of the interview is decided by the researcher and there may be areas the researcher is interested in exploring.
- 308. The researcher tries to build a rapport with the respondent and the interview is like a conversation.
- 309. (FGDs) For this method the researcher brings together a small number of subjects usually between 6 and 12 to discuss the topic of interest. The group size is kept deliberately small, so that its members do not feel intimidated but can express opinions freely. The small number of participants also makes discussion manageable by the
- 310. However, very few participants may result in an inadequate discussion and too many may lead to social loafing by others. A focus group questionnaire is called a "discussion guide", and is more of a check list of questions than a fully structured questionnaire. This is because the trick with focus groups is to put the group firmly in
- 311. The use of purposive sampling is most often employed when individuals known to have a desired expertise are sought.
- 312. Direct observation Data can be collected by an external observer, referred to as a non- participant observer. Or the data can be collected by a participant observer, who can be a member of staff undertaking usual duties while observing the processes of care.
- 313. In this type of study the researcher aims to become immersed in or become part of the population being studied, so that they can develop a detailed understanding of the values and beliefs held by members of the population. Sometimes a list of observations the researcher is specifically looking for is prepared before-hand, other times the observer makes notes about anything they observe for analysis later.
- 314. Quantitative Data Collection Method 1. Questionnaires A questionnaire is an instrument with closed questions or statements to which a respondent must react. Close-ended questionnaires ask subjects to select an answer from among several choices. The alternatives may range from a simple ‘yes’ or ‘no’ to complex expressions of opinion.
- 315. Examples 1.Have you been hospitalized as an inpatient at any time in the past 5 years? a. Yes b. No
- 316. 2. How important is it to you to avoid a pregnancy at this time? a. Extremely important b. Very important c. Somewhat important d. Not important
- 317. Scales A scale is a set of numerical values assigned to responses, representing the degree to which subjects possess a particular attitude, value or characteristic. Likert Scales
- 318. Likert scales, also called summative scales, require subjects to respond to a series of statements to express a viewpoint. Subjects read each statement and select an appropriately ranked response. Response choices commonly address agreement, evaluation, or frequency.
- 319. Likert’s original scale included five agreement categories: “strongly agree (SA), “ agree (A)”, “uncertain (U,”) “disagree (D),” and strongly disagree (SD).” The number of categories in the Likert scale can be modified: it can be extended to seven categories (by adding “somewhat disagree” and “somewhat agree”) or reduced to four categories (by eliminating “uncertain”).
- 320. For example: What is your opinion on the following statement? ‘Women who have induced abortion should be severely punished.’
- 321. Data Management Data management consists of those activities aimed at achieving a systematic, coherent manner of data collection, storage and retrieval. How data are stored and retrieved is at the heart of data management.
- 322. A good storage and retrieval system is critical for keeping track of what data are available, for permitting easy, flexible, reliable use of data and for documenting the analysis made so that the study can, in principle, be verified or replicated. A system for storage and retrieval should be designed prior to the actual data collection.
- 323. In data management, you may consider some of the following points or questions: The principal investigator is responsible for ensuring that data are of high quality by, for example, completely checking a subset of all completed interviews. Data organization: How will you name your data files? How will you organize your data into folders?
- 324. Access & security: Who will have access to your data? If the data is sensitive, how will you protect it from unauthorized access? Storage: Where will your data be stored?
- 325. Backups: This is probably the single most important item on this list. Hard drives on desktop and laptop computers fail regularly. You must have a credible backup strategy of regular backups, and of course you must then follow it. Consider including an off-site backup so that your data will not be lost if your building burns down or if your computer is stolen. Rather than relying on memory, consider an automated backup
- 326. A large amount of qualitative data can be stored on computers using a variety of available computer applications. Therefore, gaining as much knowledge as possible about computer programs is critical. It is recommended that original data be preserved for not less than a period of 5 years, as there is reasonable expectation that the original data will continue to be the basis of ongoing

- waoh
- Big boss
- kkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkwwwwwwwwww