- 1. Module 1.1: Introduction What is statistics? What is Biostatistics? Why we study Biostatistics? 1
- 2. • Statistics is a field of study concerned with: 1. the collection, organization, summarization, and analysis of data; and 2. the drawing of inferences about a body of data when only a part of the data is observed. • Biostatistics: When the tools of statistics are employed on the data derived from the biological sciences and medicine or public health, we use the term biostatistics 2
- 3. • Statistics versus statistic (field of study versus numerical quantity computed from sample data) • Roughly speaking, the field of statistics can be divided into: • Mathematical Statistics: the study & development of statistical theory and methods in the abstract and • Applied Statistics: the application of statistical methods to solve real problems involving randomly generated data, and the development of new statistical methodology motivated by real problems 3
- 4. Rationale of studying Statistics • Statistics provides a way of organizing information on a wider and more formal basis than relying on the exchange of anecdotes or biography and personal experiences • More and more things are now measured quantitatively in medicine and public health • There is a great deal of intrinsic (inherent) variation in most biological processes
- 5. Rationale of studying Statistics • The medical and public health literature is replete or full with reports in which statistical techniques are used extensively • The planning, conduct and interpretation of much of medical and public health research are becoming increasingly reliant on statistical technology 5
- 6. Limitations of statistics • It deals with only those subjects of inquiry that are capable of being quantitatively measured and numerically expressed. • It deals on aggregates of facts and no importance is attached to individual items: suited only their group characteristics are desired to be studied. • Statistical data is only approximately and not mathematically correct.
- 7. Limitations of statistics • It can be used to establish wrong conclusion and therefore, can be used only by experts. • Remember the three lies: Lies, Damon lies and Statistics • Evan Esar’s Definition of Statistics and Quote: “The science of producing unreliable facts from reliable figures” • “Statistics is the only science that enables different experts using the same figures to draw different conclusions” 7
- 8. Variable • As we observe a characteristic, we find that it takes on different values in different persons, places, or things, called variable. The characteristic is not the same when observed in different possessors of it. • Quantitative variables: is one that can be measured in the usual sense. For example, measurements on the heights of adults, the weights of children, and the ages of patients. • Qualitative Variables: characteristics that can be categorized only, like possess or not to possess some characteristic of interest, ethnic group, etc. 8
- 9. • Random Variable: Whenever we determine the height, weight, or age of an individual, the result is frequently referred to as a value of the respective variable. • When the values obtained arise as a result of chance factors, so that they cannot be exactly predicted in advance, the variable is called a random variable. • When a child is born, we cannot predict exactly his or her height at maturity. Attained adult height is the result of numerous genetic and environmental factors. 9
- 10. Scales of measurement • Scales of measurement refer to ways in which variables/numbers are defined and categorized. Each scale of measurement determines the appropriateness for use of certain statistical analyses. • There are four scales of measurement: nominal, ordinal, interval, and ratio. 10
- 11. Scales of measurement • Nominal: Categorical data and numbers that are simply used as identifiers or names represent a nominal scale of measurement. • Example: gender code Female as 1 and Male as 2 or visa versa • Ordinal: An ordinal scale of measurement represents an ordered series of relationships or rank order. • Example: Likert-type scales; how much pain are you in today? (on a scale of 1 to 10 with one being no pain and ten being high pain), represent ordinal data. 11
- 12. Scales of measurement • Interval: A scale which represents quantity and has equal units but for which zero represents simply an additional point of measurement is an interval scale. • In interval scales zero does not represent the absolute lowest value. • Example: Measurement of temperature in Fahrenheit scale, measurement of Sea levels 12
- 13. Scales of measurement • Ratio: The ratio scale of measurement is similar to the interval scale in that it also represents quantity and has equality of units. However, this scale also has an absolute zero (no numbers exist below the zero). A negative length is not possible. • Example: physical measures height and weight. • Often, the distinction between interval and ratio scales can be ignored in statistical analyses. • Distinction between these two types and ordinal and nominal are more important. 13
- 14. Data • Data are observations of random variables made on the elements of a population or sample • Data are the quantities (numbers) or qualities (attributes) measured or observed that are to be collected and/or analyzed • The word data is plural, datum is singular • A collection of data is often called a data set (singular) 14
- 15. Data and information • Data is raw, unorganized facts that need to be processed. Data can be something simple and seemingly random and useless until it is organized. • Example: Each newborn’s birth weight • When data is processed, organized, structured or presented in a given context so as to make it useful, it is called information. • Example: Mean birth weight of newborns 15
- 16. Types of data 1. Nominal data • In statistics/biostatistics, we encounter many different types of data. • One of the simplest types of data is nominal data, in which the values fallen to unordered categories or classes. Example: sex, marital status, ethnicity, religion, etc. • Numbers are often used to represent the categories. In a certain study, for instance, males might be assigned the value 1 and females the value 0 16
- 17. 2. Ordinal data • When the order among categories becomes important, the observations are referred to as ordinal data. • For example injuries may be classified according to their level of severity, so that 1= fatal, 2= severe, 3= moderate, and 4= minor. • Here a natural order exists among the groupings: a smaller number represents a more serious injury. However we are still not concerned with the magnitude of these numbers. 17
- 18. 3. Discrete data • For discrete data both ordering and magnitude are important. • In this case, the numbers represent actual measurable quantities or counts rather than mere labels. • Examples of discrete data include the number of car accidents in a given month, the number of times a woman has given birth. 18
- 19. 4. Continuous data • Data that represent measurable quantities but are not restricted to taking on certain specified values. • In this case the difference between any two possible data values can be arbitrarily small. • Examples of continuous data include time, the serum cholesterol level of a patient, etc. 19
- 20. Types and Methods of Data Collection • The statistical data may be classified under two categories depending up on the sources: - Primary Data: are those data which are collected by the investigator himself for the purpose of a specific inquiry or study. - Secondary Data: when an investigator uses data which have already been collected by others. 20
- 21. Data collection methods 1. Observation • It is a technique that involves systematically selecting, watching, and recording behaviors of people, measuring characteristics or other phenomena. • It includes all methods from simple visual observations to the use of high level machines. • Advantage: Gives relatively more accurate data on behavior and activities. • Disadvantages: Investigator’s or observer’s own bias, prejudice, desires may be reflected and needs more resources and skilled human power during the use of high level machines. 21
- 22. 2 . Self-administered Questionnaire & Interviews • These are the most commonly used research data collection techniques. • Self-administered questionnaire is – simpler and cheaper – can be administered to many persons simultaneously – can be sent by post (unlike interviews) • But requires a certain level of education and skill on the part of the respondents • People of a low socio-economic status are less likely to respond 22
- 23. 3. Face-to-face and telephone interviews – An interview is a conversation for gathering information. A research interview involves an interviewer, who coordinates the process of the conversation and asks questions, and an interviewee, who responds to those questions. – A good interviewer can stimulate and maintain the respondent’s interest, and can create a rapport (understanding) and atmosphere conducive to the answering of questions. – If anxiety aroused, the interviewer can allay it. If a question is not understood an interviewer can repeat it and explain. 23
- 24. 4. Mailed Questionnaire Method • The investigator prepares a questionnaire pertaining to the field of inquiry and are sent by post to the informants together with a polite covering letter explaining the detail, the aims and objectives of collecting the information • Requests the respondents to cooperate by furnishing the correct replies and returning the questionnaire duly filled in • Drawback: response rates tend to be relatively low, and there may be under representation of less literate subjects 24
- 25. 5. Use of Documentary Sources • Includes clinical and other personal records, death certificates, published mortality statistics, census publications, etc. • Examples: - Official publications of CSA - Publication of MoH and other Ministries - Newspapers and Journals - International publications (WHO, UNICEF) - Records of Hospitals or any HI 25
- 26. 6. Computer Direct Interviews • These are interviews in which the Interviewees enter their own answers directly into a computer. • They can be used at malls, trade shows, offices, and so on. • The Survey System's optional Interviewing Module and Interview Stations can easily create computer-direct interviews. Some researchers set up a Web page survey for this purpose. 26
- 27. Advantages • The virtual elimination of data entry and editing costs • You will get more accurate answers to sensitive questions • Elimination of interviewer bias • Ensuring skip patterns are accurately followed • Response rates are usually higher 27
- 28. Disadvantages • The Interviewees must have access to a computer or one must be provided for them. • As with mail surveys, computer direct interviews may have serious response rate problems in populations of lower educational and literacy levels. This method may grow in importance as computer use increases. 28
- 29. Choosing Method of data collection • Decision Makers Need Information that is Relevant, Timely, Accurate and Useable 29
- 30. • The selection of the method of data collection is also based on practical considerations, such as: The need for personnel, skills, equipment, etc. into what is available and the urgency with which results are needed. The acceptability of the procedures to the subjects – the absence of inconvenience, unpleasantness, or untoward The probability that the method will provide a good coverage, i.e. will supply the required information about all or almost all members of the population or sample 30
- 31. Choice of survey method will also depend on several factors. These include: Speed Email and Web page surveys are the fastest methods, followed by telephone interviewing. Mail surveys are the slowest. Cost Personal interviews are the most expensive followed by telephone and then mail. Email and Web page surveys are the least expensive for large samples. Computer and Internet Usage Web page and Email surveys offer significant advantages, but you may not be able to generalize their results to the population as a whole. Literacy Levels Illiterate and less-educated people rarely respond to mail surveys. Sensitive Questions People are more likely to answer sensitive questions when interviewed directly by a computer in one form or another. 31
- 32. Designing Questionnaire When designing a questionnaire the following points should be taken into account – Keep it (questions) short and simple (KISS) – Questions should be unambiguous and not double barreled – Use simple and direct language. The questions must be clearly understood by respondent. – The wording of a question should be simple and to the point. – The best kinds of questions are those which allow a pre-printed answer to be ticked 32
- 33. – Questions should be neither irrelevant nor too personal – Leading questions shouldn’t be asked. A “leading question” is one that suggests the answer. – The questionnaire should be designed so that the questions should fall into a logical sequence. – After finalizing developing the questionnaire, translate it into local languages to be used for data collection – The last step in questionnaire design is to test the questionnaire with a small number of interviews before conducting your main interviews - pilot. 33
- 34. General Considerations To be successful involve other experts and relevant decision-makers in the questionnaire design process Formulate a plan for doing the statistical analysis during the design stage of the project If you used one method in the past and need to compare results, stick to that method, unless there is a compelling reason to change 34
- 35. Types of questions Open-ended Questions: - Permit free responses that should be recorded in the respondent’s own words. It is used in Facts with which the researcher is not very familiar Opinions, attitudes, and suggestions of informants, or Sensitive issues 35
- 36. Closed Questions: Offer a list of possible options or answers from which the respondents must choose. Offer a list of options that are exhaustive and mutually exclusive, and Keep the number of options as few as possible. 36
- 37. Interviewing technique • Before the questionnaire is used for the data collection, it should be pre-tested • Manuals that explain each of the questions should be prepared – question-by-question specification • Enumerators and field supervisors should be trained before they are deployed to the field 37
- 38. • Enumerator should create good communication environment with the respondents. • They should precisely explain the questions in the questionnaire to the respondent. He/she should not lead the respondent. • There should be strong supervision to the field work until it will be completed. 38
- 39. Rules for asking questions Read Qs as they are written Do not change order of Qs Read the Qs slowly and clearly Read Qs in a pleasant voice Maintain eye contact which is culturally appropriate Read the entire question to Respondent Do not skip Qs Verify information given by Respondent 39
- 40. Interviewing tactics of Sensitive Questions • Sensitive questions may offend the respondents –Expose the respondent’s ignorance –Call for socially unacceptable answer –Embarrassments 45
- 41. Possible tactics (Barton) – The everybody approach – as you know many people have been arrested for being involved in theft. Do you happen to have arrested for being involved in theft? – The other people approach – Do you know any one arrested of theft? How about yourself? – The Kinsey technique – stare firmly into the respondents’ eyes and as in simple, clear-cut language such as that to which respondent is accustomed, and with and air of assuming that everybody has done everything, ‘Have you ever arrested for being involved in theft?’ 46
- 42. Informed consents Participation in a survey should be voluntary and a respondent can refuse to be interviewed or measured, etc. The information given should be simple and clear and adapted to the respondent’s level of understanding. Informed consents can be either signed or verbal 48
- 43. The interviewer is responsible for explaining: – what the survey is about, – providing all the necessary information, and – making sure the respondent understands the implications of his/her participation before giving his/her consent. • The information given should be simple and clear and adapted to the respondent’s level of understanding. 49
- 44. • Consents must be documented by asking the respondents to sign an Informed Consent Form or give verbal consent before doing the interview. – These forms must mention: • who will be doing the study, • the types of questions that will be asked, • why the study is being done, and • who will have access to the information provided. 50
- 45. Module 1.2: Methods of data processing, organization and presentation 51
- 46. No. Ht Wt Sex age FEV No. Ht Wt Sex age FEV 1 175.2 79.2 1 57 3.80 16 177.5 69.7 1 32 4.10 2 164.5 92.4 6 60 3.50 17 164.0 719 2 58 3.15 3 168.5 64.6 1 62 1.48 18 174.0 63.2 1 45 4.25 4 180.0 82.6 1 43 4.35 19 161.0 60.0 2 59 2.75 5 156.0 79.9 2 13 2.70 20 169.5 63.3 3 53 3.32 6 170.0 80.9 1 61 2.35 21 181.5 101.3 1 37 4.20 7 170.0 79.7 1 67 149 22 173.0 72.9 1 47 4.45 8 162.0 57.4 1 63 2.95 23 473.6 55.9 2 39 3.65 9 177.0 98.1 1 46 4.20 24 178.2 39.2 1 70 3.05 10 285.0 61.6 2 47 2.45 25 159.0 63.5 2 42 3.20 11 156.0 60.0 2 43 2.10 26 149.0 69.2 2 58 29.3 12 157.0 62.0 3 34 3.41 27 159.0 80.3 2 63 2.45 13 150.0 51.8 2 49 2.70 28 190.0 883.0 1 60 4.65 14 154.0 58.1 2 47 2.45 29 175.0 85.0 7 41 3.75 15 165.0 70.6 1 79 3.10 30 168.7 855 1 60 3.15 52
- 47. Data cleaning and edition • When the questionnaires are collected from the field, they should be coded and edited • Checks are basically of two sorts, range checks and consistency checks. Range checks: exclude, for example, the erroneous occurrence of code 3 for sex, which should only be code 1(male) or code 2(female). Consistency checks: detect impossible combinations of data 53
- 48. Basic precautions recommended to minimize errors during the handling of data: • Avoid any unnecessary copying of data from one form to another • Use a verification procedure during data entry - range and skip rules, double data entry, etc. • Check all calculations carefully, example – date conversion, units of measurement, etc. 54
- 49. Data organization: Tables The use of tables for presenting data involves grouping the data into mutually exclusive categories of the variable, and counting the number of occurrences to each category Tables should be as simple as possible and self- explanatory Numerical entities of zero should be explicitly written rather than indicated by a dash Totals should be shown either in the top row and the first column or in the last row and last column If data are not original, their source should be given in a footnote 55
- 50. Asthma versus sex and smoking Sex and smoking status Presence of Asthma No Yes n % n % Total Sex Female 459 91.6 42 8.4 501 Male 439 93.0 33 7.0 472 Total 898 92.3 75 7.7 973 Smoking Never smoker 480 91.4 45 8.6 525 Ex-smoker 254 91.7 23 8.3 277 Current smoker 164 95.9 7 4.1 171 Total 898 92.3 75 7.7 973 56
- 51. Data presentation: Diagrams • Allows readers to obtain an overall grasp of the data presented. • The relationship can be seen more quickly and easily from a graph than from a table. • The choice of one graph over the other depends on personal choices and/or the type of the data. Bar chart and pie chart are commonly used for quantitative discrete or qualitative data Histograms, frequency polygon, and line graphs are used for quantitative continuous data 57
- 52. Component Bar graph - Smoking status and presence of asthma 0 10 20 30 40 50 60 70 80 90 100 Never smoker Ex-smoker Current smoker Number of individuals Smoking status No Yes 58
- 53. Pie-chart – smoking status (%) Never smoker 54% Ex-smoker 28% Current smoker 18% 59
- 54. Histogram for FEV1 data 60
- 55. Neonatal Mortality Rate by Sex 65.8 34.2 37.2 46.3 25.8 29.0 29.3 50.2 44.8 49.0 54.6 41.4 38.7 34.3 0.0 10.0 20.0 30.0 40.0 50.0 60.0 70.0 2005 2006 2007 2008 2009 2010 2011 NNMR per 1000 LB Surveillance year Female Male 61
- 56. General rules for constructing graphs • Every graph should be self-explanatory and as simple as possible • Titles are usually placed below the graph • Legends or keys should be used to differentiate variables if more than one is shown • The axes label should be placed to read from the left side and from the bottom • The units into which the scale is divided should be clearly indicated • The numerical scale representing frequency must start at zero or a break in the line should be shown 62
- 57. Module 1.3: Data summarization 63
- 58. Data Exploration • The exploration procedure produces summary statistics and graphical displays • The reasons for using the explore procedure are: – data screening, – outlier identification, – description, – assumption checking, and – characterizing differences among subpopulations (groups of cases). 64
- 59. No. Ht Wt Sex age FEV No. Ht Wt Sex age FEV 1 175.2 79.2 1 57 3.80 16 177.5 69.7 1 32 4.10 2 164.5 92.4 1 60 3.50 17 164.0 71.9 2 58 3.15 3 168.5 64.6 1 62 1.48 18 174.0 63.2 1 45 4.25 4 180.0 82.6 1 43 4.35 19 161.0 60.0 2 59 2.75 5 156.0 79.9 2 47 2.70 20 169.5 63.3 2 53 3.32 6 170.0 80.9 1 61 2.35 21 181.5 101.3 1 37 4.20 7 170.0 79.7 1 67 0.80 22 173.0 72.9 1 47 4.45 8 162.0 57.4 1 63 2.95 23 164.2 55.9 2 39 3.65 9 177.0 98.1 1 46 4.20 24 178.2 93.2 1 70 3.05 10 160.5 61.6 2 47 2.45 25 159.0 63.5 2 42 3.20 11 156.0 60.0 2 43 2.10 26 149.0 69.2 2 58 2.20 12 157.0 62.0 2 34 3.41 27 159.0 80.3 2 63 2.45 13 150.0 51.8 2 49 2.70 28 190.0 88.3 1 60 4.65 14 154.0 58.1 2 47 2.45 29 175.0 85.0 1 41 3.75 15 165.0 70.6 1 79 3.10 30 168.7 85.5 1 60 3.15 65
- 60. • Data screening may show that you have unusual values, extreme values, gaps in the data, or other peculiarities. • Exploring the data can help to determine whether the statistical techniques that you are considering for data analysis are appropriate. • The exploration may indicate that you need to transform the data if the technique requires some known distribution, say the Normal distribution. 66
- 61. Measures of Central tendency - The arithmetic mean, median and mode - Arithmetic mean is unique, takes into account all data points and leads itself for further manipulation but sensitive to extreme values - Median is unique, not sensitive to all data points and not affected by extreme values - Mode might not exist and be unique, it can be determined for qualitative data 67
- 62. Exercise • Calculate the mean, median and mode for the whole sample and sex specific summary values using the data in the table below • Sex – 1=Male, 2=Female • Height if measured in cm, weight in kg, age in years and FEV in liter 68
- 63. Ht Wt Sex age FEV 175.2 79.2 1 57 3.80 164.5 92.4 1 60 3.50 168.5 64.6 1 62 1.48 180.0 82.6 1 43 4.35 156.0 79.9 2 47 2.70 170.0 80.9 1 61 2.35 170.0 79.7 1 67 0.80 162.0 57.4 1 63 2.95 177.0 98.1 1 46 4.20 160.5 61.6 2 47 2.45 156.0 60.0 2 43 2.10 157.0 62.0 2 34 3.41 150.0 51.8 2 49 2.70 154.0 58.1 2 47 2.45 165.0 70.6 1 79 3.10 69
- 64. Summary values Sex Age Ht Wt FEV Male Mean 54.85 173.54 80.27 3.42 Median 59.94 174.00 80.90 3.75 Mode 32.47 170.00 57.40 4.20 Sum 932.47 2950.10 1364.60 58.13 n 17 17 17 17 Female Mean 49.16 158.40 64.42 2.81 Median 47.40 159.00 62.00 2.70 Mode 34.43 156.00 60.00 2.45 Sum 639.04 2059.20 837.50 36.53 n 13 13 13 13 Both Mean 52.38 166.98 73.40 3.16 Median 50.96 166.75 71.25 3.15 Mode 32.47 156.00 60.00 2.45 Sum 1571.51 5009.30 2202.10 94.66 n 30 30 30 30 70
- 65. Measures of Variation/Dispersion • Dispersion of a set of observations refers to the scatteredness of observations around a measure of central tendency Commonly used measures of variation: Range, Percentiles, and Standard deviation. Of these measures only standard deviation is a measure of variation since it assesses the scatteredness of observations around the mean 71
- 66. The Coefficient of Variation To compare the variability of two or more sets of data for same or different variables, standard deviations may lead to fallacious results. • The variables involved might be measured in different units, or different characteristics • Coefficient of Variation (CV) is the standard deviation expressed as a percentage of the mean. 72
- 67. Use the above data to determine standard deviation and Coefficient of variation Sex Age Ht Wt FEV Male Mean 54.85 173.54 80.27 3.42 Variance 160.7 49.53 157.22 1.15 Std dev 12.68 7.04 12.54 1.07 CV 23.1 4.1 15.6 31.3 Range 46.06 28 43.9 3.85 Female Mean 49.16 158.4 64.42 2.81 Variance 74.16 32.65 74.78 0.24 Std dev 8.61 5.71 8.65 0.49 CV 17.5 3.6 13.4 17.4 Range 28.98 20.5 28.5 1.55 Both Mean 52.38 166.98 73.40 3.16 Variance 127.58 99.03 181.48 0.83 Std dev 11.3 9.95 13.47 0.91 CV 21.6 6.0 18.4 28.8 Range 46.06 41 49.5 3.85 73
- 68. Data transformations • The assumptions underlying a statistical method may not always be satisfied by a particular set of data. • For example, a distribution may be positively skewed rather than normal. Such problems can often be overcome simply by transforming the data to a different scale of measurement • The most common choice is the logarithmic transformation 74
- 69. Logarithmic transformation • When a logarithmic transformation is applied to a variable, each individual value is replaced by its logarithm. y = log x • Where x is the original value and y the transformed value. • The logarithm has the effect both of equalizing the standard deviations and removing skewness (absence of symmetry) 75
- 70. Choice of a transformation • There are alternative transformations • Reciprocal transformation:- is stronger than the logarithmic, and would be appropriate if the distribution were considerably more positively skewed than lognormal. Y=1/x 76
- 71. • Square root transformation:- is used when the constant variance assumption does not hold true. • It is weaker than the logarithmic transformation. • Negative skewness can be removed by using power transformation, such as a square or a cubic transformation, the strength increases with the order of the power x y 77
- 72. Histogram & Normal curve with transformations 78
- 73. Module 2: Probability and Probability Distributions 79
- 74. Probability Distributions • Definition: A random variable is a numerical quantity that takes different values with specified probabilities. • There are two types of random variables: discrete and continuous. • Definition: A random variable for which there exists a discrete definition of values with specified probabilities is a discrete random variable. 80
- 75. Probability Distributions • Example: Diarrhoea is one of the most frequent reasons for visiting health institutions in the first 2 years of life in children. • Let X be the random variable that represents the number of episodes of diarrhoea in the first 2 years of life. Then X is a discrete random variable, which takes on values 0,1,2, .... • Definition: A random variable whose values form a continuum (i.e., have no gaps) such that ranges of values occur with specified probabilities is a continuous random variable. 81
- 76. Probability Mass Function for a Discrete Random Variable • The values taken by a discrete random variable and its associated probabilities can be expressed by a rule, or relationship that is called a probability density function (pdf). • Definition: A pdf is a mathematical relationship, or rule, that assigns to any possible value of a discrete random variable X the probability P(X = r). This assignment is made for all values r that have positive probability. The pdf is also referred to as probability distribution. 82
- 77. General rules which apply to any probability distribution 1. Since the values of a probability distribution are probabilities, they must be numbers in the interval from 0 to 1. 2. Since a random variable has to take on one of its values, the sum of all the values of a probability distribution must be equal to 1. • Example: Check whether the following function can serve as the probability distribution of an appropriate random variable 83
- 78. General rules … 12 2 ) ( x x f for x=1, 2, and 3 Substituting the values of x, f(1)=3/12, f(2)=4/12, and f(3)=5/12 Since none of these values is negative or greater than one, and since their sum 3/12+4/12+5/12 = 1, the given function is a probability distribution 84
- 79. Example on Hypertension-control: • Suppose a physician agrees to use a new anti- hypertensive drug on a trial basis on the first 4 untreated hypertensives whom she encounters in her practice before deciding whether to adopt the drug for routine use. • Let X = the number of patients out of 4 who are brought under control. Suppose that from previous experience with the drug, for any clinical practice, the drug company expects the following probabilities. r 0 1 2 3 4 P(X=r) .008 .076 .265 .411 .240 85
- 80. Example: • For the above table, for any clinical practice, the probability that between 0 and 4 hypertension’s are brought under control = 1, i.e., • 0.008 + 0.076 + 0.265 + 0.411 + 0.240 = 1 • What is the probability that: – At least two patients brought under control? – At most three patients brought under control? 86
- 81. 1. Binomial distribution • The Binomial distribution with parameters n and p is a discrete probability distribution of the number of successes in a sequence of n independent binary (yes/no) experiments, each of which yields success with probability p. • A useful summary measure, used to describe binary variables, is the proportion with which the variable took one of its values, called success. • The binomial distribution is used to model the number of successes in a sample of size n drawn with replacement from a population of size N. 87
- 82. The Binomial Distribution • Definition: The distribution of the number of successes (r) in n statistically independent trails, where the probability of success on each trail is P, is known as the binomial distribution, and has a probability density function given by: where • The mean is np and variance is np(1-p) r n r P) (1 P r n r) P(X r = 0, 1, 2, …, n ! )! ( ! r r n n r n 88
- 83. Probability mass function for the binomial distribution 89
- 84. Example: • What is the probability of obtaining 2 boys out of 5 children if the probability of a boy is 0.51 at each birth and the sexes of successive children are considered independent random variables? • n=5, p=0.51, 1-p=0.49 and r=2 0.306 (0.49) (0.51) 2!3! 5! (0.49) (0.51) 2 5 2) P(x 3 2 3 2 90
- 85. Continuous Probability Distribution • A continuous probability distribution is a smooth density curve that models the distribution of a continuous random variable. • The area under the curve is 1 and the area within any interval is approximately the probability that the value of the random variable is in that interval. • Density function is a formula used to represent the distribution of a continuous random variable. 91
- 86. Definition • Probability distribution for a continuous random variable for a nonnegative function f(x) (probability density function) is: – Total area bounded by its curve and the x- axis is equal to one – Subarea under the curve bounded, X-axis and the perpendiculars erected at any two points give the probability that x is between a and b 92
- 87. 2. Normal distribution • The Normal Distribution also called the Gaussian distribution is the most important of the distribution in all statistics. • The normal density is given by: = 3.141….. and e = 2.72…. x where e x f x 2 2 1 2 1 93
- 88. Characteristics 1. It is symmetrical about its mean 2. Mean, median and mode are equal 3. The total area under the curve above the x axis is one square unit 4. One SD from the mean in both directions approximately 68% of the area 5. The height of the curve = 6. The normal distribution is determined by the parameters standard deviation and mean. 2 / 1 94
- 89. The Normal Distribution curve σ = σx μ = μx 95
- 90. Cont… 96
- 91. The standard Normal distribution • Definition: A normal distribution with mean 0 and variance 1 will be referred to as a standard, or unit, normal distribution. This distribution is denoted by N(0,1). 2 2 1 z 2π 1 f(z) e for - < z < + This distribution is symmetrical about 0 (the mean), since f(x)=f(-x). About 68% of the area under the normal density lies +1 and -1, about 95% lies between +2 and -2, and about 99% lies between +2.5 and -2.5 97
- 92. Application of Normal distribution • Example: Suppose it is know that the height of a population of individual are approximately normally distributed with a mean of 70 inches and standard deviation of 3 inches. What is the probability that a person picked at random from this group will be a) between 65 and 74 inches tall? b) greater than 75 inches c) less than 65 inches 98
- 93. Solution Step 1: Transform this to standard normal distribution by using Step 2: Determine the area under the curve bounded by the curve, x-axis and the two points. P( a<z<b). Step 3: Look at the z distribution table for the corresponding value of z. 99
- 94. 3. The t-distribution • The t-distribution is a family of continuous probability distributions that arise when estimating the mean of a normally distributed population in situations where the sample size is small and population standard deviation is unknown. • Whereas a normal distribution describes a full population, t-distributions describe samples drawn from a full population; accordingly, the t-distribution for each sample size is different. 100
- 95. The t-distribution • The t-distribution is similar in shape to the Normal distribution but is more spread out with longer tails than the standard Normal. • It is symmetrical about zero, its mean, and the variance, σ2 is = k/(k-2) for k > 2, k = df, µ does not exist for k=1, σ2 does not exists for k = 1,2 • The df increases with the sample size. As the sample size increases, the shape of the t- distribution becomes increasingly more like the standard Normal distribution. • It is used for estimation of means. 101
- 97. The t-distribution ν = n−1 degrees of freedom 103
- 98. Module 3.1: Sampling methods and Sample size estimation 104
- 99. Why sample? • It is usually not cost effective or practicable to collect and examine all the data that might be available. • Instead it is often necessary to draw a sample of information from the whole population to enable the detailed examination required to take place. • Sampling provides a means of gaining information about the population without the need to examine the population in its entirety. 105
- 100. • Purposes of sampling: Provides various types of statistical information of a qualitative or quantitative nature about the whole by examining a few selected units. • Advantages of sample based studies – Cost effectiveness – Timeliness – Inaccessibility of some people – Less destructive in data summarization – Accuracy 106
- 101. Caveats • Sampling can provide a valid, defensible methodology but it is important to match the type of sample needed to the type of analysis required. • The auditor should also take care to check the quality of the information from which the sample is to be drawn. If the quality is poor, sampling may not be justified. 107
- 102. Sampling Designs • Sample design covers the method of selection, the sample structure and plans for analysing and interpreting the results. • Sample designs can vary from simple to complex and depend on the type of information required and the way the sample is selected. • The design will impact upon the size of the sample and the way in which analysis is carried out. In simple terms the tighter the required precision and the more complex the design the larger the sample size. 108
- 103. Sampling Designs • The design may make use of the characteristics of the population, but it does not have to be proportionally representative. • It may be necessary to draw a larger sample than would be expected from some parts of the population; • For example, to select more from a minority grouping to ensure that we get sufficient data for analysis on such groups. 109
- 104. Sampling Designs • The aim of the design is to achieve a balance between the required precision and the available resources. 110
- 105. Definition of terms • Sample – Subset of the population of interest • Sampling – process of selecting units from the population of interest so that by studying the sample we generalize our result back to population. • Sampling can provide a valid, defensible methodology but it is important to match the type of sample needed to the type of analysis required. 111
- 106. • Population - Finite or infinite set of objects whose properties are to be studied. • Study population/sample population – subset of target population chosen so as to be representative of the total population • Sampling unit - unit of selection in the sampling process. • Study unit – subject on which information is collected. 112
- 107. Conditions that needs to be met The sample must be well chosen – Representative the method of choosing the sample matters the best methods involve the planned introduction of chance A sampling procedure should be fair, selecting people for inclusion in the sample in an impartial way, so as to get a representative cross section of the public – No selection bias When a selection procedure is biased, taking a large sample does not help. This just repeats the basic mistake on a large scale 113
- 108. Conditions … A sample chosen in a haphazard fashion, or because it is ‘handy’, is unlikely to be a representative one. This kind of samples may be used in exploratory surveys to get a ‘feel’ about the situation The sample must be sufficiently large – Sample size There must be adequate coverage of the sample – Response rate Non-respondents can be very different from respondents. When there is high non-response rate, lookout for non-response bias. 114
- 109. Is a sample any good? Some samples are really bad. To find out whether a sample is any good, ask: 1. How it is chosen? 2. Was there selection bias? 3. Non-response bias? These questions might not be answered just by look at the data 115
- 110. Sampling techniques/methods • Sampling is the process of selecting a number of study units from a defined study population. • Clearly define study population and study unit – Study population – individuals, households, institutions, records, etc… – Study units – an individual, a household, an institution or a record 116
- 111. Sampling cont… • Types: probability and non-probability – Probability – quantitative studies – Non-probability – qualitative studies • Probability sampling technique: – Involves using random selection procedures to ensure that each unit of the sample is chosen on the basis of chance. – All units of the study population should have an equal, or at least a known non-zero chance of being included in the sample. – Sample drawn in such a way that it is representative of the population – The type to be used depends on population composition and availability of sampling frame 117
- 112. Sampling cont… Probability sampling methods include: – Simple random sampling – Systematic sampling – Stratified sampling – Cluster sampling – Multistage sampling 118
- 113. 1. Simple random sampling • Selecting required number of sampling units randomly from list of all units – Up-to-date Sampling frame – Random selection – manually using table of random numbers or using computer programs • E.g. 250 households from list of 9000 households • Better representativeness but costly and representativeness reduced in heterogeneous population 119
- 114. 2. Systematic sampling • Sampling units are selected at regular intervals. The starting unit is selected randomly • Example: to select a sample of 100 students from 2500, first calculate sampling interval=2500/100=25. Then randomly select the first student and finally pick every 25th student • Easier and less time consuming • Can be done without sampling frame – sequential studies • Risk of bias if there is cyclic repetition 120
- 115. 3. Stratified sampling • Used when the population structure consists distinct subgroups/strata • Ensures proportions of individuals with certain characteristics in the sample will be the same as those in the whole population – Representation of groups with different characteristics • The study population must be divided into strata of the characteristic (Example: residence, age, sex, profession) and then random or systematic samples are obtained from each stratum 121
- 116. 3. Stratified sampling cont. • Depending on the need, samples from each stratum can be drawn either proportional to their size or non- proportionally/equal size from each stratum – Proportional- using sampling fraction (N/n) – Equal size – to represent small groups • Improved representativeness • Estimates can be obtained for each stratum and the population 122
- 117. 4. Cluster sampling • Groups of study units (clusters) instead of individual study units are selected at a time • Assumes homogeneity of population with respect the characteristic to be measured • All the study units in the selected clusters are included in the study • Used in geographically scattered areas where visiting dispersed study units is time consuming and costly • Example: a simple random sample of 5 villages from 30 villages • Easier but less representative 123
- 118. 5. Multistage sampling • Carried out in stages – PSU, SSU… • Used in very large and diverse populations • The method used in most community-based big studies • E.g. In a study to be undertaken in a big town the sampling may involve stages like selection of kefetegnas, kebeles and finally houses • Representativeness and reduced cost 124
- 119. 5. Multistage sampling • The larger the number of clusters, the greater is the likelihood that the sample will be representative. • Further, the sampling units at community level should be selected randomly (avoid convenience sampling!). 125
- 120. Bias in sampling • Bias in sampling is a systematic error in sampling procedures, which leads to a distortion in the results of the study. • Bias can be introduced as a consequence of improper sampling procedures, which result in the sample not being representative of the study population. 126
- 121. Bias … • There are several possible sources of bias that may arise when sampling. The most well known source is non-response. • Non-response can occur in any interview situation • Respondents may refuse or forget to fill in the questionnaire • The problem lies in the fact that non-respondents in a sample may exhibit characteristics that differ systematically from the characteristics of respondents. 127
- 122. Bias … There are several ways to deal with this problem and reduce the possibility of bias: 1. Data collection tools should be pre-tested. 2. If non-response is due to absence of the subjects, follow-up of non-respondents may be considered. 3. If non-response is due to refusal to co-operate, an extra, separate study of non-respondents may be considered in order to identify to what extent they differ from respondents. 4. Include additional people in the sample, so that non- respondents can be replaced if their absence was very unlikely to be related to the topic being studied. 128
- 123. Bias … Other sources of bias in sampling: Studying volunteers only – volunteers are motivated to participate in the study. Sampling of registered patients only – Patients reporting to a clinic are likely to differ systematically from people seeking alternative treatments Seasonal bias. Tarmac bias – easily accessible by car. 129
- 124. Non-probability sampling methods Quota Sampling: Each data collector is assigned a fixed quota of subjects to interview; the number falling into certain categories (like residence, sex, age, etc.) are also fixed. On the other hand, the interviewers are free to select anybody they like. From common sense point of view, quota sampling looks good. It seems to guarantee that the sample will be like the population with respect to all the important characteristics that affect the variable of interest. 130
- 125. In quota sampling, the sample is hand-picked to resemble the population with respect to some key characteristics. The method seems reasonable, but does not work very well. The reason is unintentional bias on the part of the interviewers. 131
- 126. Other non-probability sampling methods • Purposive sampling • Snowball or chain sampling • Extreme case sampling • Maximum variation sampling • Homogeneous sampling • Critical case sampling 132
- 127. Sample size estimation • How many subjects are needed in the sample to enable draw conclusion on the whole population? – Depends on expected variation in the data and number of units per cell for analysis – The eventual sample size is a compromise between what is desirable and what is feasible 133
- 128. Sample size cont… • Minimum sample size can be calculated depending on the objective of the study – Estimation of population parameter with certain precision • Single variable estimation (single population mean, proportion or rate) • Descriptive studies - Prevalence, coverage and utilization rate studies – Test of significant difference between groups • Analytic studies - comparative cross-sectional, case- control, cohort and clinical trials 134
- 129. Sample size - single proportion • For making confidence limit statement (such as prevalence study), the following formula can be used to estimate minimum sample size: • For population <10,000, use finite population correction 2 2 2 1 1 d P P Z n P P Z N d P P Z N nf 1 1 1 2 2 1 2 2 2 1 135
- 130. Single proportion cont… • Parameters in the formula – n is minimum sample size – P is estimate of the prevalence rate for the population • From available data, or Pilot study result, or 0.5 should be used to get the possible minimum large sample size; if given in range, take the value closest to 0.5. – d is the margin of sampling error tolerated – Z1-α/2 is the standard normal variable at (1-α )% confidence level and α is mostly taken to be 5% • Usually 95% confidence level is used = 1.96 – N population size 136
- 131. Exercise • What sample size do we need to estimate the prevalence of HIV among residents of a town such that the error of estimation is within 1% of its actual parameter with 95% confidence? 137
- 132. Measuring prevalence for more than one item in one group • Take estimated prevalence of the most important item to be measured or • Determine sample size for each item/specific objective and then – Take estimated prevalence of the item that gives the maximum sample size 138
- 133. Sample size-two proportion For test of significance study the following formula can be used: Parameters: n - size of sample in each group P1 ,P2 – estimated population prevalence in the comparison groups β = 1- Power (the probability that if the two proportions differ the test will produce a significant difference) – Usually a power of 80% or 90% is used 2 2 1 2 2 1 1 2 2 1 1 p p p p p p Z Z n 139
- 134. Exercise A study is designed to assess the difference in the proportion of physicians leaving health services in urban and rural areas. From available literature 30% and 15% of physicians are estimated to leave services in rural and urban areas within three years of graduation respectively. What sample size is required for the study? 140
- 135. Sample size – case-control studies • Formula – • Parameters: – P1 ,P0–estimated prevalence of exposure in the case and controls respectively – P0 can be estimated as the population prevalence of exposure – P′ – derived from P1 ,P0, m and odds ratio – OR : odds ratio of exposures between cases and controls – m : number of control subjects per case subject 2 1 2 1 1 1 1 1 1 o o o p p p mp p p z p p m z n 141
- 136. Exercise • Example: Suppose you want to test presence of difference in exposure status between cases and controls at 95% confidence level and with power of 80% using a 1:1 ratio of cases to controls while looking for an odds ratio of 2. You assume the prevalence of exposure controls is 25%. How many sample size do you need? 142
- 137. Sample size-two proportion • More than one comparison variable – take the one with the smallest estimated difference – To get largest sample size • Different formulae – Case-control studies – Matched studies – Survival analysis – Other cases • Reference – http://www.statsdirect.com/help/sample_size_and_me thods/sms.htm 143
- 138. Five key factors 1. Confidence level: how certain you want to be that the population figure is within the sample estimate and its associated precision. 2. Variability in the population: the SD is the most usual measure and often needs to be estimated. 3. Margin of error or precision: a measure of the possible difference between the sample estimate and the actual population value. 4. The population proportion: the proportion of items in the population displaying the attributes that you are seeking. 5. Population size: only important if the sample size is greater than 5% of the population in which case the sample size reduces. 144
- 139. Sample size – other considerations • Non-response – Add contingency – say 10% • More – sensitive topic, self-administered questionnaire (up to 30%) – Response rate for • Cross-sectional survey >85% • Cohort - >60-80% • Sampling technique – In complex samples (cluster, multistage) increase the sample size to account for design effect 145
- 140. Sample size – other considerations cont. – Design effect - ratio variance of estimate derived from a complex sampling design to the variance of estimate from simple random sample – Usually sample size is multiplied by 2 (1.5) in cluster sampling • Increase – large PSU, many stages, clustered variable • Qualitative methods – estimate, not determined • Better to have good quality data than large sample after a certain point • Better to have representative than large sample – Use representative sampling techniques 146
- 141. Sampling distribution Definition: A parameter is a numerical descriptive measure of a population (μ). A statistic is a numerical descriptive measure of a sample ( ). To each sample statistic there corresponds a population parameter. We use , S2, S , p, etc. to estimate μ, σ2, σ, P (or π), etc. X X 147
- 142. Sampling distribution of Means • The sampling distribution of means is one of the most fundamental concepts of statistical inference, and it has remarkable properties. • Since it is a frequency distribution, it has its own mean and standard deviation Example: let a population of size 6 has values for weight of individuals with 55.7, 66.7, 85.5, 79.7, 122.4 and 78.1. Select all possible samples of size 3 from this population and check if the sample mean is unbiased estimate of population mean and calculate the standard error of the sample mean. 148
- 143. Measurements of weight of individuals of the population Population values: 55.7 66.7 85.5 79.7 122.4 78.1 Sum of observations 488.1 Population mean (µ) 81.35 Population SD (σ) 20.77 All possible unique sample 20 n N N X N X 2 2 ) ( 149
- 144. Sample Obs1 Obs2 Obs3 Mean S1 55.7 66.7 85.5 69.30 S2 55.7 66.7 79.7 67.37 S3 55.7 66.7 122.4 81.60 S4 55.7 66.7 78.1 66.83 S5 55.7 85.5 79.7 73.63 S6 55.7 85.5 122.4 87.87 S7 55.7 85.5 78.1 73.10 S8 55.7 79.7 122.4 85.93 S9 55.7 79.7 78.1 71.17 S10 55.7 122.4 78.1 85.40 S11 66.7 85.5 79.7 77.30 S12 66.7 85.5 122.4 91.53 S13 66.7 85.5 78.1 76.77 S14 66.7 79.7 122.4 89.60 S15 66.7 79.7 78.1 74.83 S16 66.7 122.4 78.1 89.07 S17 85.5 79.7 122.4 95.87 S18 85.5 79.7 78.1 81.10 S19 85.5 122.4 78.1 95.33 S20 79.7 122.4 78.1 93.40 Sum of means 1627.00 Mean of means 81.35 Variance of means 86.27 SD of sample means 9.29 n N n N n n N n X X n X 1 X of error Standard X deviation Standard X means sample of Mean 1 ) ( S variance Sample X mean Sample 2 2 150
- 145. Properties 1. The mean of the sampling distribution of means is the same as the population mean, μ 2. The SD of the sampling distribution of sample means is ≈ σ/√n if n is large 3. The sampling distribution of sample means is approximately normal, regardless of the shape of the population distribution provided n is large (> 30) enough (Central limit theorem). 1 N n N n 151
- 146. Module 3.2: Estimation and Hypothesis Testing 152
- 148. Estimation Definition Calculating some statistics from sample data that is offered as an approximation of the corresponding parameter of the population from which the sample was drawn. 154
- 149. Cont… Estimator: Methods or rules to compute values/ estimate. Estimator need to have characteristics of unbiasedness. • T of the parameter x is said to be unbiased estimator of x if E(T) =x. 155
- 150. Cont… • Estimation is calculating, from sample data, some statistic that offers an approximation for the corresponding parameter of the population from which the sample is drawn. • Properties of good estimators – Unbiased: An estimator is said to be unbiased if in the long run it takes on the value of the population parameter – Efficiency: An estimator is said to be efficient if in the class of unbiased estimators it has minimum variance – Consistency: A sequence of estimators is said to be consistent if it converges in probability to the true value of the parameter – Sufficiency: an estimator is sufficient if it uses all the sample information 156
- 151. Estimation methods • Point estimate: a single numeric value used to estimate the corresponding population parameter. frequently used point estimators ( sample statistic) sample statistic coresponding population sample mean population mean sample variance population variance sample standard deviation population standard deviation sample proportion population proportion 157
- 152. Interval Estimate • Interval estimate: Two numerical values defining a range of values that, with a specified degree of confidence, we feel include the parameter being estimated. 158
- 153. Cont… • Even if sample mean is good quality estimator, it is better to explain in an interval regarding the probable magnitude of population mean. • Confidence intervals are about putting some bounds on how far away the truth might be from your estimate. • Sample mean is the best unbiased estimator. 159
- 154. Cont… • If the sample is drawn from normally distributed population, sample distribution will be normal. • Even if the distribution of the population is non normal, sampling distribution will assume normal distribution if sample size is sufficiently large. • Ninety-five (95%) percent of possible value of will lie between two standard deviation of x 2 2 s x 160
- 155. Interval estimator component • Reliability coefficient value of Z or t within the standard error: • Standard error – measure of sample mean variability in repeated sampling. n x z n s x t 161
- 156. Standard Error of the Mean • It helps us to quantify in some way how good our estimate of the mean is of the true, & unknown, population mean- how large an error might we be making • Standard error of sample mean is 𝑆𝐷 𝑛 and it is: • Error that arise from variability in the sample means • It indicates the variability of the distribution of means of samples caused by sampling error and measurement error. 162
- 157. Confidence interval • The confidence interval provides a range that is highly likely (often 95% or 99%) to contain the true population value, or parameter that is being estimated. • The narrower the interval the more informative is the result. It is usually calculated using the point estimate and its standard error. 163
- 158. • Provide an interval around our estimate showing how much error there might be either side of the estimate lower upper confidence estimate confidence interval interval 164
- 159. Interval estimate for mean: one sample situation • Confidence interval of the mean with known population standard deviation • Confidence interval of the mean with unknown population standard deviation for small sample size n Z x x SE z x 2 / 1 ) 2 / 1 ( ) ( n s n t x x se df t x ) 1 ( ) ( ) ( 2 / 1 2 / 1 165
- 160. Cont… Interpretation of confidence interval • Probabilistic: in repeated sampling from a normally distributed population with known SD of all interval will in the long run include population mean • Practical: when sampling from normally distributed population with known SD (σ), we are confident that the single computed interval contains the population mean. 166
- 161. Cont… • Confidence coefficient commonly used values are 0.9, 0.95 & 0.99 associated reliability coefficient value of 1.645, 1.96 and 2.58 respectively for the standard normal random variable (Z). • Precision: The quantity obtained by multiplying the reliability factor by the SE of the mean called margins of error. 167
- 162. Computing a 95 and 99% CI for μ • Given = 19.26, σ = 2.52 and n = 117 • At 95% confidence level, α = 0.05 (α/2=0.025) and at 99% α = 0.01 (α/2=0.005) • Z0.975 = 1.96 and Z0.995 = 2.58 95% CI for μ becomes • 19.26 1.96*2.52/117 = (18.80 μ 19.72) 99% CI for μ becomes • 19.26 2.58*2.52/117 = (18.66 μ 19.86) x 168
- 163. Computing CI for μ when σ is unknown • When the population SD (σ) is unknown, it should be estimated from the sample SD (s) • Accordingly, the standard error of the sample mean will be estimated by s/√n • Therefore, the say 95% CI for μ with n < 30 will be based on the t-statistic as: where (n-1) is the degree of freedom n s n t x / ) 1 ( 975 . 0 169
- 164. Example • Consider the following summary information based on data on systolic blood pressure of a random sample of 30 individuals selected from a normal population. Compute a 95% and 99% CI for μ • n=30, df=30-1=29, at 95% confidence level, t0.975(29)= 2.045 and at 99%, t0.995(29)=2.756, se( )=16.3/30=2.98 • 95% CI for μ: 115.9 2.045*2.98 = (109.8 μ 122.0) • 99% CI for μ: 115.9 2.756*2.98 = (107.7 μ 124.1) 3 . 16 s , 9 . 115 X x 170
- 165. Standard Error of the difference between two sample means • Most medical research is comparative, as a result we are more often concerned with two or more samples rather than a single sample, i.e., compare difference between two samples. • This helps in deciding whether or not it is likely that the two mean are equal • When the interval includes 0, the two means might be equal. • When the interval does not include zero the two mean are different. 171
- 166. Cont…. The Z test statistic can be used in confidence interval to estimate difference between two mean if the variances of the populations are known A 95% confidence interval for the difference of the two means is given by: 2 2 2 1 2 1 2 1 2 2 2 1 2 1 975 . 0 2 1 96 . 1 ) ( ) ( n n X X n n Z X X 172
- 167. Unknown Variance The t-test statistic is used when the population standard deviations are unknown and small sample size under the two sets of conditions 1. When equal variance is assumed 2. When the variance are unequal 173
- 168. Cont… • When the variance are equal, the variances are pooled to estimate the common variance. • Pooled estimate is obtained by weighing average of the two sample variance. • Each sample variance is weighed by its degree of freedom (n-1). • If the sample size are equal, the weighed average equal the arithmetic mean of the two sample variance. • If the sample size are different, weighed average take the advantage of additional information provided by the larger sample. 174
- 169. Unknown but equal variances • The pooled standard deviation (Sp) is calculated using the following formula: • Then the standard error of the difference of the two sample means is: 2 ) 1 ( ) 1 ( 2 1 2 2 2 2 1 1 n n S n S n Sp 2 1 2 1 1 1 ) ( n n S X X se p 175
- 170. Example: Was there a difference in the mean fasting blood glucose level between men and women given data from normal populations Sex Mean SD n Men 98.14 19.59 57 Women 95.19 14.03 59 Total 96.64 16.98 116 • Compute a 95% CI for the population mean difference – Assuming the standard deviations (SD) are population SD – Assuming the population variances are unknown but assumed to be equal 176
- 171. Factors affecting the length of a confidence interval (CI) – Sample size (n) – Standard deviation (σ) – Confidence level (1-α) 177
- 172. Hypothesis Testing Why is hypothesis testing so important? • Hypothesis testing provides an objective framework for making decisions using probabilistic method, rather than relying on subjective impressions. • The Null hypothesis, denoted by Ho, is the hypothesis that is to be tested. • The alternative hypothesis H1 is the hypothesis that in some sense contradicts the null hypothesis. 178
- 173. Cont… • While making decision on the null and alternative hypothesis, we have four possible outcomes: 1. We accept Ho, and Ho is in fact true – confidence level (1-α). 2. We accept Ho, and H1is in fact true – Type II error (β). 3. We reject Ho, and Ho is in fact true – Type I error (α). 4. We reject Ho, and H1 in fact is true – Power of the test (1- β). 179
- 174. One Sample Test for the Mean from a Normal population 1. One Sided Alternative (One-tailed) Unknown Variance • A one tailed test is a test in which the values of the parameter being studied (in this case mean) under the alternative hypothesis are allowed to be either greater than or less than the values of the parameter under the null hypothesis, but not both 180
- 175. Cont… I. Alternative mean < Null mean • One sample t -test for the mean of a normal distribution with Unknown variance to test the hypothesis: If t < t1- with n-1 df, then Do not Reject Ho If t >= t1- with n-1 df, then Reject Ho n s X t o 181
- 176. Cont… Two ways to determine statistical significance: 1. Critical value method – comparing the tabulated value of the test statistic to the calculated value for a given level of significance 2. P-value method 182
- 177. Cont… The p value is the α level at which the given value of the test statistic (such as t) would be on the boarder line between the acceptance and rejection zone. P=p(tn-1 ≤ t) where p is the area to the left of ’t’ under a tn-1 distribution. 183
- 178. Guidelines to judge p-value 1. If 0.01 <= p < 0.05, statistically significant 2. If 0.001 <= p < 0.01, statistically highly significant 3. If p < 0.001, very highly statistically significant 4. If p > 0.05, not statistically significant 184
- 179. II. Alternative mean >Null mean • To test the hypothesis: Ho: = Vs H1 : > , Variance Unknown With a significant level, , the test is based on ‘t’ where: • If t > tn-1, 1-α Ho is rejected • If t < tn-1, 1- α Ho is accepted o o n s x t o / 185
- 180. Cont… 2. Two-sided alternatives (two tailed) It is a test in which the values of the parameter being studied under the alternate hypothesis are allowed to be either greater than or less than the values of the parameter under the null hypothesis, Ho. 186
- 181. Cont… • To test the hypothesis: Ho : = versus H1: ≠ with a significant level of /t/ > tn-1,1- α /2 Ho rejected /t/ < tn-1,1- α /2 Ho accepted n s x t o / o o 187
- 182. Cont… • P-value for two tailed t-test n s x t o / 0 t if )] ( 1 [ 2 0 t if ) ( 2 1 1 t t P P t t P P n n 188
- 183. Cont… One sample Z-test - Two Tailed • The critical values and p-values for the one sample t-test have been specified in terms of percentiles of the t distribution, assuming that the underlying variance is unknown. • In some applications, the variance may be assumed known from prior studies. In this case, the test statistic t-test is replaced by the test statistic ′Z′ 189
- 184. Cont… To test the hypothesis, we use Z < Z α /2 or Z > Z1- α /2 ,reject Ho Z α /2 < Z < Z 1- α /2 , Don’t reject Ho n x z o / 190
- 185. Cont… • One Tail • Alternative mean < Null mean (Variance Known) Z < Z α , then Ho rejected Z > Z α, Ho accepted • Alternative mean > Null mean (Variance Known) Z > Z1- α , then Ho rejected Z < Z α, Ho accepted 191
- 186. Relationship between Hypothesis Testing and confidence interval –Two sided case • Suppose we are testing Ho : = versus H1: Ho is rejected with a two –sided level alpha test if and only if the two sided confidence interval for Does not contain , otherwise accept Ho. o o o 192
- 187. Hypothesis Testing Two Sample Inference • In a two sample hypothesis testing, the underlying parameters of two different Population, neither of whose values is assumed Known, are compared. • Two samples are said to be Paired when each data point of the first sample is matched and is related to a unique data point of the second sample. 193
- 188. Cont… • Two samples are said to be independent if the data points in one sample are unrelated to the data points in the second sample 194
- 189. The paired t- test • the statistic is denoted by where SD(d) is the sample standard deviation of the observed difference and n is the number of differences n d SD d t ) ( 195
- 190. Cont… • Degree of freedom n-1 – If t>tn-1 ,1- α /2 or t<-tn-1, 1- α /2 then Ho is rejected. – - tn-1, 1- α /2 <t<tn-1, 1- α /2 • P- value is 2x the area of ‘t’ 196
- 191. • Example: • Suppose a sample of 20 students were given a test before studying a particular module and then again after completing the module. • We want to find out if, in general, our teaching leads to improvements in students’ knowledge/skills (i.e. test scores). 197
- 192. Student Score Difference Student Score Difference Pre- module Post- module Pre- module Post- module 1 18 22 4 11 14 15 1 2 21 25 4 12 16 15 -1 3 16 17 1 13 16 18 2 4 22 24 2 14 19 26 7 5 19 16 -3 15 18 18 0 6 24 29 5 16 20 24 4 7 17 20 3 17 12 18 6 8 21 23 2 18 22 25 3 9 23 19 -4 19 15 19 4 10 18 20 2 20 17 16 -1 198
- 193. 199 • Hypothesis: Ho: △=0 and HA: △≠0 • Calculating the mean and standard deviation of the differences: 𝑑= 2.05 and sd(d) = 2.837. Therefore, se(𝑑) = 2.837/ 20 = 0.634 • So, we have: t = 2.05/0.634 = 3.231 on 19 df with p = 0.004. • Therefore, there is strong evidence that, on average, the module does lead to improvements.
- 194. Two sample t – test for independent sample with equal variance • The equation is given by: where, the weighted average of variance1 and variance2 could simply used as the estimate of • The degree of freedom will be the sum of the degree of freedom of the two samples, i.e., (n1-1) + (n2-1) 2 1 2 1 1 1 n n S X X t p 2 200
- 195. Estimation and Hypothesis testing of population proportion 201
- 196. Sampling distribution of proportions Construction • It is done in the same manner as that of the mean • take all possible samples of a given size • Compute the sample proportion for each • Prepare a frequency distribution of the proportions 202
- 197. Cont… Characteristics: – When the sample size is large the distribution is approximately normal – The mean of the distribution, , will be equal to the true proportion P. – the variance of the distribution, , will be equal to P̂ 2 p̂ n p p ) 1 ( 203
- 198. Sampling distribution of difference between two proportions • For independent random samples n1 and n2 drawn from two populations of dichotomous variables and when P1 and P2 are the population proportions of the characteristic • Distribution of is approximately normal with mean: • And variance: 2 1 ˆ ˆ p p 2 1 ˆ ˆ 2 1 p p p p 2 2 2 1 1 1 2 ˆ ˆ ) 1 ( ) 1 ( 2 1 n p p n p p p p 204
- 199. Estimation of single proportions • Confidence intervals of proportions by approximation to the normal distribution and the sample standard deviation. • The confidence interval for the population proportion : where p is the proportion of successes (event), q=(1 - p) is the proportion of failures, n is the sample size and z denotes the z value relating to a defined probability level. n p p Z p ) 1 ( 205
- 200. Estimation of difference between two proportions • Unbiased point estimators are • Standard error of the estimate when n1 and n2 are large enough and are not close to 1 or 0 • Since population proportions are not known 2 2 2 1 1 1 ˆ ˆ ) ˆ 1 ( ˆ ) ˆ 1 ( ˆ 2 1 n p p n p p p p 2 1 ˆ ˆ p and p 2 1 ˆ ˆ p p 206
- 201. Cont… • Therefore,100(1-α)% confidence interval will be: 2 2 2 1 1 1 ) 2 / 1 ( 2 1 ) ˆ 1 ( ˆ ) ˆ 1 ( ˆ ) ˆ ˆ ( n p p n p p p p 207
- 202. Hypothesis testing on single population proportions • Follows from the properties of the sampling distribution of the sample proportion • The null hypothesis and • The alternate hypothesis o A o o P P H P P H : : 208
- 203. Cont… • Test statistics • Where Ho is true the sample proportions are approximately distributed as standard normal distribution n p p p p Z o o ) 1 ( ˆ 0 209
- 204. Testing differences between two sample proportions • The most commonly used test Ho: P1-P2 = 0 or P1=P2 • Under Ho, thus pooled estimate for the proportions will be • Standard error 2 1 2 2 1 1 2 1 2 1 n n p n p n n n x x P 2 1 ˆ ˆ ) 1 ( ) 1 ( 2 1 n p p n p p p p 210
- 205. Cont… • The test statistic will be: 2 1 ˆ ˆ 2 1 2 1 ˆ ˆ p p P P p p z 211
- 206. Example: Comparison of number of swimming hours’ by swimmers with or without erosion of dental enamel Number of swimming hours per week Erosion of dental enamel (EDE) Total Yes No ≥ 6 hours 32 118 150 < 6 hours 17 127 144 Total 49 245 294 212 Prevalence of EDE (P) 0.167 Standard error 0.022 95% CI for P: Lower 0.124 Upper 0.209
- 207. 1. Estimate the prevalence of erosion of dental enamel and calculate a 95% CI 2. From previous studies among swimmers it is claimed that the prevalence of erosion of dental enamel was 14%. Is the claim justified? Give your p-value 213
- 208. 3. Compute the respective prevalence of erosion of dental enamel for those who had 6 hours and < 6 hours of swimming time and calculate a 95% CI for the difference in the prevalence. 4. Is there a difference in the prevalence of erosion of dental enamel between the two swimming times? Give your p-value 214
- 209. Amount of swimming time per week P ≥ 6 hours 0.213 < 6 hours 0.118 Total 0.167 p1 – p2 0.095 Ho: P1=P2, HA: P1≠P2 se(p1-p2) 0.044 Z 2.174 95% CI for P1-P2 se(p1-p2) 0.042 Lower 95% 0.013 Upper 95% 0.177 215
- 210. Exercise: A study was conducted to look at the effect of oral contraceptives (OC) on heart disease in women 40-44 years of age over 3 years. Given the following data, is there a difference in the rate of MI between OC-users and non-users? Compute 95% CI for the difference. OC-use group MI status over 3 years Total Yes No OC-users 13 4,987 5,000 No-OC-users 7 9,993 10,000 Total 20 14,980 15,000 216
- 212. Errors in • Design • Execution • Analysis • Presentation • Interpretation • Omission 218
- 213. Statistical errors related to study design • Study aims and primary outcome measures not clearly stated or unclear • In adequate sample size • Choice of inappropriate high risk sample to make inferences about the general population • Failure to report number of participants or observations • Use of an inappropriate control group 219
- 214. Errors in execution • Failure to adhered to the study protocol – Misuse of sample selection procedures – Exclusion and inclusion criteria not strictly followed – Failure to follow randomization procedures 220
- 215. Statistical errors in presentation • Inadequate graphical or numerical description of basic data – Presenting or plotting mean but no indication of variability – Giving SE instead of SD to describe data – Failure to define ± notation for describing variability – Numerical information given to an unrealistic level of precision to present data and results – Inappropriate graph selection that doesn’t reflect characteristics of variables and use of three dimensional graph for two dimension presentation 221
- 216. 222
- 217. Statistical errors in analysis • Using methods of analysis when assumptions are not met • Analyzing paired data ignoring the pairing • Failing to take account of ordered categories • Treating multiple observations on one subject as independent o Improper multiple pair-wise comparisons of more than two groups o Quoting confidence intervals that include impossible values • Failure to use multivariate techniques to adjust for confounding factors 223
- 218. Statistical errors in interpretation of study findings • Wrong interpretation of results “non significant” interpreted as “no effect”, or “no difference” Drawing conclusions not supported by the study data Significance claimed without data analysis or statistical test mentioned • Failure to discuss sources of potential bias and confounding factors 224
- 219. Consequences of statistical errors • Impossible to get ethical approval to conduct the study • Others researchers may be led to follow false line of investigation • Patients may receive an inferior treatment , either as a direct consequence of the result of the study or possibly by the delay in the introduction of a truly effective treatment • If the results go unchallenged the researchers may use the same inferior statistical methods in future research, and others may copy them due to inappropriate conclusion 225