Bi ostat for pharmacy.ppt2
Upcoming SlideShare
Loading in...5
×
 

Bi ostat for pharmacy.ppt2

on

  • 567 views

 

Statistics

Views

Total Views
567
Views on SlideShare
567
Embed Views
0

Actions

Likes
1
Downloads
4
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Bi ostat for pharmacy.ppt2 Bi ostat for pharmacy.ppt2 Presentation Transcript

    • BIOSTATISTICSSchool of pharmacy (COMH 607) 1
    • 1.RESEARCH METHODS 2
    • 1.1.Introduction to ResearchWhat is Research?• A scientific study to seek hidden knowledge• A scientific study to answer a question• A scientific study of causes and effects• A scientific attempt towards new discoveries• A systematic method of inquiry• A logical attempt to find answers to problems• A systematic approach to a (medical) problem 3
    • Statistical Concept of Research• Research is a systematic collection, analysis and interpretation of data in order to solve a research question• It is classified as: – Basic research: necessary to generate new knowledge and technologies. – Applied research: necessary to identify priority problems and to design and evaluate policies and programs for optimal health care and delivery. 4
    • 1.2. Types of Epidemiological DesignA. Descriptive studies• Mainly concerned with the distribution of diseases with respect to time, place and person.• Useful for health managers to allocate resource and to plan effective prevention programmes.• Useful to generate epidemiological hypothesis, an important first step in the search for disease determinant or risk factors.• Can use information collected routinely which are readily available in many places. So generally descriptive studies are less expensive and less time-consuming than analytic studies. 5
    • • It is the most common type of epidemiological design strategy in medical literature.• There are three main types: – Correlational – Case report or case series – Cross-section 6
    • A.1. Correlational or Ecological• Uses data from entire population to compare disease frequencies – between different groups during the same period of time, or in the same population at different points in time.• Does not provide individual data, rather presents average exposure level in the community.• Cause could not be ascertained.• Correlation coefficient is the measure of association in correlational studies. It is important to note that positive association does not necessarily imply a valid statistical association. 7
    • Eg.• Hypertension rates and average per capita salt consumption compared between two communities.• Average per capita fat consumption and breast cancer rates compared between two communities.• Comparing incidence of dental cares in relation to fluoride content of the water among towns in the rift valley.• Mortality from CHD in relation to per capita cigarette sales among the regions of Ethiopia. 8
    • • Strength: Can be done quickly and inexpensively, often using available data.• Limitation: – Inability to link exposure with disease. – Lack of ability to control for effects of potential confounding factors. There may be other things that at the true cause. – It may mask a non-linear relationship between exposure and disease. For example alcohol consumption and mortality from CHD have a non- linear relationship (the curve is “J” shaped), 9
    • A.2. Case Report and Case Series• Describes the experience of a single or a group of patients with similar diagnosis. Has limited value, but occasionally revolutionary.• E.g. 5 young homosexual men with PCP seen between Oct. 1980 and May 1981 in Los Angeles arose concern among physicians. Later, with further follow-up and thorough investigation of the strange occurrence of the disease the diagnosis of AIDS was established for the first time. 10
    • • Strength: – very useful for hypothesis generation.• Limitations: – Report is based on single or few patients, which could happen just by coincidence. Lack of an appropriate comparison group 11
    • A.3. Cross Sectional Studies (Survey• Information about the status of an individual with respect to the presence or absence of exposure and disease is assessed at the same point in time. Easy to do-many surveys are like this.• For factors that remain unaltered overtime, such as sex, race or blood group, the cross-sectional survey can provide evidence of a valid statistical association.• Useful for raising the question of the presence of an association rather than for testing a hypothesis. 12
    • B. ANALYTIC STUDIES• Focuses on the determinants of a disease by testing the hypothesis formulated from descriptive studies, with the ultimate goal of judging whether a particular exposure causes or prevents disease.• Broadly classified into two – observational and interventional studies. – Both types use “controls”. The use of controls is the main distinguishing feature of analytic studies. 13
    • B.1. Observational studies• Information are obtained by observation of events. No intervention is done. Cohort and case-control are in this category.i. Cohort• Subjects are selected by exposure, or determinants of interest, and followed to see• If they develop the disease or outcome interest.• E.g. Follow 100 children who received BCG vaccination and another 100 who didn’t get BCG vaccination and see how many of them get tuberculosis. 14
    • • ii. Case Control• Subjects are selected with respect to presence or absence of disease, or outcome of interest, and then inquiries are made about past exposure to the factor(s) of interest.• E.g. Take people with and without TB, ask them if they ever had BCG vaccination. 15
    • B.2. Interventional / Experimental• The researcher does something about the disease or exposure and observe the changes.• Investigator has control over who gets exposure and who don’t. The key is that the investigator assign into either group, whether it is done randomly or not.• Always prospective.• E.g. Assign children randomly to get chloroquine or not, and see how many develop symptomatic malaria. 16
    • Description of common termsStatistics- It is the process of scientifically collecting, organizing, summarizing and interpreting of data, and the drawing of inferences about a body of data when only part of the data are observed.Biostatistics- It is a special statistics in which the data being analyzed are derived from biological and medical scienceDescriptive statistics: A statistical method that is concerned with the collection, organization, summarization, and analysis of data from a sample of population.Inferential statistics: A statistical method that is concerned with the drawing of inferences/ conclusions about a particular population by selecting and measuring a random sample from the population. 17
    • Population: Is the largest collection of entities/values of a random variable for which we have an interest at a particular time. Population could be finite or infinite. We can take the whole number of students in a given class (e.g. 100 students) as a population. • Target population: A collection of items that have something in common for which we wish to draw conclusions at a particular time. • Study Population: The specific population from which data are collected 18
    • Sample: It is some part/subset of population of interest. In the above example, if we randomly select 25 students from the 100, we call the former as sample of the class. Hence, Generalizability is a two-stage procedure: we want to a generalize from the sample to the study population and then from the study population to the target population 19
    • Eg.: In a study of the prevalenceof HIV among orphan children inEthiopia, a random sample oforphan children in LidetaKifleKetema were included.Target Population: All orphanchildren in EthiopiaStudy population: All orphanchildren in Addis AbabaSample: Orphan children inLideta KifleKetema 20
    • Statistical inference: It is the procedure by which we reach a conclusion about a population on the basis of the information contained in a sample that has been drawn from that population.Parameter: It is numerical expression of population measurements E.g. population mean (µ), population variance, population standard deviation, etc A descriptive measure computed from the data of a population.Statistic: A descriptive measure computed from the data of a sample.Statistical data: Information that is systematically collected tabulated and analysis for which the result is interpreted to draw conclusions about the result obtained. 21
    • • Data: aggregate of variables as a result of measurement or counting.• Variable: A characteristics that takes on different values in different persons, places, or things. – Dependent variable(response) :variable (s)we measure as an out come of interest – Independent variable(predictor) :The variable(S) that determines the outcome 22
    • Categorical variable: The notion of magnitude is absent or implicit.– Nominal: have distinct levels that have no inherent ordering. – When only with two categories, are called binary or dichotomous.Eg. Sex; male or female – When more than two categories -are called polythumous eg color– Ordinal: have levels that do follow a distinct ordering. Eg. severity of pain(mild, moderate severe) 23
    • Quantitative(numeric) variable: Variable that has magnitude• Discrete data: when numbers represent actual measurable quantities rather than mere labels.  Discrete data are restricted to taking only specified values often integers or counts that differ by fixed amounts. e.g. Number of new AIDS cases reported during one year period, Number of beds available in a particular hospital• Continuous data: represent measurable quantities but are not restricted to taking on certain specific values i.e fractional values are possible. Can use interval (no true zero value) or ratio scale (begins at zero)– e.g. weight, cholesterol level, time, temperature 24
    • 1.3.Sampling MethodsSampling• The process of selecting a portion of the population to represent the entire population.• A main concern in sampling: – Ensure that the sample represents the population, and • The findings can be generalized. 25
    • Advantages of sampling:• Feasibility: Sampling may be the only feasible method of collecting information.• Reduced cost: Sampling reduces demands on resource such as finance, personnel, and material.• Greater accuracy: Sampling may lead to better accuracy of collecting data• Sampling error: Precise allowance can be made for sampling error• Greater speed: Data can be collected and summarized more quickly 26
    • Disadvantages of sampling:• There is always a sampling error.• Sampling may create a feeling of discrimination within the population.• Sampling may be inadvisable where every unit in the population is legally required to have a record.Errors in sampling1) Sampling error: Errors introduced due to selection of a sample. – They cannot be avoided or totally eliminated.2) Non-sampling error: - Observational error - Respondent error - Lack of preciseness of definition - Errors in editing and tabulation of data 27
    • Divisions of Sampling MethodsTwo broad divisions:A. Probability sampling methodsB. Non-probability sampling methods 28
    • 1.4.1. Probability sampling• Involves random selection of a sample• A sample is obtained in a way that ensures every member of the population to have a known, non zero probability of being included in the sample.• Involves the selection of a sample from a population, based on chance. 29
    • • Probability sampling is: – more complex, – more time-consuming and – usually more costly than non-probability sampling.• However, because study samples are randomly selected and their probability of inclusion can be calculated, – reliable estimates can be produced and • inferences can be made about the population. 30
    • • There are several different ways in which a probability sample can be selected.• The method chosen depends on a number of factors, such as – the available sampling frame, – how spread out the population is, – how costly it is to survey members of the population 31
    • Most common probability sampling methods 1. Simple random sampling 2. Systematic random sampling 3. Stratified random sampling 4. Cluster sampling 5. Multi-stage sampling 32
    • 1. Simple random sampling(SRS)• Involves random selection• Each member of a population has an equal chance of being included in the sample.• To use a SRS method: – Make a numbered list of all the units in the population – Each unit should be numbered from 1 to N (where N is the size of the population) – Select the required number. 33
    • • The randomness of the sample is ensured by: • use of “lottery’ methods • a table of random numbers – Using computer programes• Example• Suppose your school has 500 students and you need to conduct a short survey on the quality of the food served in the cafeteria.• You decide that a sample of 10 students should be sufficient for your purposes.• In order to get your sample, you assign a number from 1 to 500 to each student in your school. 34
    • • To select the sample, you use a table of randomly generated numbers.• Pick a starting point in the table (a row and column number) and look at the random numbers that appear there. In this case, since the data run into three digits, the random numbers would need to contain three digits as well.• Ignore all random numbers after 500 because they do not correspond to any of the students in the school.• Remember that the sample is without replacement, so if a number recurs, skip over it and use the next random number.• The first 10 different numbers between 001 and 500 make up your sample 35
    • • SRS has certain limitations: – Requires a sampling frame. – Difficult if the reference population is dispersed. – Minority subgroups of interest may not be selected. 36
    • 2. Systematic random sampling• Sometimes called interval sampling, systematic sampling means that there is a gap, or interval, between each selected unit in the sample• The selection is systematic rather than randomly – Individuals are chosen at regular interval from the sampling frame. Ideally we randomly select a number to tell us where to start selecting individuals from the list.• Important if the reference population is arranged in some order: – Order of registration of patients – Numerical number of house numbers – Student’s registration books – Taking individuals at fixed intervals (every kth) based on the sampling fraction, eg. if the sample includes 20%, then every fifth. 37
    • Steps in systematic random sampling1. Number the units on your frame from 1 to N (where N is the total population size).2. Determine the sampling interval (K) by dividing the number of units in the population by the desired sample size. 38
    • Steps….In order to find one study unit, during survey, it is important to figure out how many houses must be visited usually through doing a pilot study.• Example: Assume you are doing a study involving children under 5. There are 1500 households in all, and you have a required sample size of 100 children. From a preliminary study you have done, there is one child every 2.5 households. Normally, if there were a child in every household, you would visit 100 households. But because not every household includes a child, you will need to visit 100 x 2.5 or 250 households to find the required 100 children.• The sampling interval will therefore be1500/250 or every 6th household. 39
    • 3. Select a number between one and K at random. This number is called the random start and would be the first number included in your sample.4. Select every Kth unit after that first number Note: Systematic sampling should not be used when a cyclic repetition is inherent in the sampling frame. 40
    • ExampleTo select a sample of 100 from a population of 400, you would need a sampling interval of 400 ÷ 100 = 4.Therefore, K = 4.You will need to select one unit out of every four units to end up with a total of 100 units in your sample.Select a number between 1 and 4 from a table of random numbers.• If you choose 3, the third unit on your frame would be the first unit included in your sample;• The sample might consist of the following units to make up a sample of 100: 3 (the random start), 7, 11, 15, 19...395, 399 (up to N, which is 400 in this case). 41
    • The main difference with SRS, any combination of 100 units would have a chance of making up the sample, while with systematic sampling, there are only four possible samples. 42
    • Advantages .  • Systematic sampling is usually less time consuming and easier to perform than SRS• It provides a good approximation to SRS (. i.e. has highest precision)• Unlike SRS, systematic sampling can be conducted without a sampling frame. So, systematic random sampling is useful when preparing sampling frame is not readily available. – E.g. In patients attending a health center, where it is not possible to predict in advance who will be attending 43
    • Disadvantage• If there is any sort of cyclic pattern in the ordering of the subjects, which coincides with the sampling interval, the sample will not be representative of the population. – May result in systematic error 44
    • 3. Stratified random sampling• It is done when the population is known to have heterogeneity with regard to some factors and those factors are used for stratification• Using stratified sampling, the population is divided into homogeneous, mutually exclusive groups called strata, and – A population can be stratified by any variable that is available for all units prior to sampling (e.g., age, sex, province of residence, income, etc.).• A separate sample is taken independently from each stratum.• Any of the sampling methods mentioned in this section (and others that exist) can be used to sample within each stratum. 45
    • Why do we need to create strata?• That it can make the sampling strategy more efficient.• A larger sample is required to get a more accurate estimation if a characteristic varies greatly from one unit to the other.• For example, if every person in a population had the same salary, then a sample of one individual would be enough to get a precise estimate of the average salary.• This is the idea behind the efficiency gain obtained with stratification. – If you create strata within which units share similar characteristics (e.g., income) and are considerably different from units in other strata (e.g., occupation, type of dwelling) then you would only need a small sample from each stratum to get a precise estimate of total income for that stratum. 46
    • – Then you could combine these estimates to get a precise estimate of total income for the whole population.• If you use a SRS approach in the whole population without stratification, the sample would need to be larger than the total of all stratum samples to get an estimate with the same level of precision. 47
    • • Stratified sampling ensures an adequate sample size for sub- groups in the population of interest.• When a population is stratified, each stratum becomes an independent population and you will need to decide the sample size for each stratum. 48
    • • Equal allocation: – Allocate equal sample size to each stratum• Proportionate allocation: , j = 1, 2, ..., k where, k is the number of strata and n nj = Nj N – nj is sample size of the jth stratum – Nj is population size of the jth stratum – n = n1 + n2 + ...+ nk is the total sample size – N = N1 + N2 + ...+ Nk is the total population size 49
    • 4. Cluster sampling • Sometimes it is too expensive to spread a sample across the population as a whole. • Travel costs can become expensive if interviewers have to survey people from one end of the country to the other. • To reduce costs, researchers may choose a cluster sampling technique • The clusters should be homogeneous, unlike stratified sampling where by the strata are heterogeneous 50
    • Steps in cluster sampling• Cluster sampling divides the population into groups or clusters.• A number of clusters are selected randomly to represent the total population, and then all units within selected clusters are included in the sample.• No units from non-selected clusters are included in the sample— they are represented by those from selected clusters.• This differs from stratified sampling, where some units are selected from each group. 51
    • Example• In a school based study, we assume students of the same school are homogeneous.• We can select randomly sections and include all students of the selected sections only 52
    • • As mentioned, cost reduction is a reason for using cluster sampling.• It creates pockets of sampled units instead of spreading the sample over the whole territory.• Another reason is that sometimes a list of all units in the population is not available, while a list of all clusters is either available or easy to create. 53
    • • In most cases, the main drawback is a loss of efficiency when compared with SRS.• It is usually better to survey a large number of small clusters instead of a small number of large clusters. – This is because neighboring units tend to be more alike, resulting in a sample that does not represent the whole spectrum of opinions or situations present in the overall population. 54
    • • Another drawback to cluster sampling is that you do not have total control over the final sample size.• Since not all schools have the same number of (say Grade 11) students and city blocks do not all have the same number of households, and you must interview every student or household in your sample, as an example, the final size may be larger or smaller than you expected. 55
    • 5. Multi-stage sampling• Similar to the cluster sampling, except that it involves picking a sample from within each chosen cluster, rather than including all units in the cluster.• This type of sampling requires at least two stages. 56
    • • In the first stage, large groups or clusters are identified and selected. These clusters contain more population units than are needed for the final sample.• In the second stage, population units are picked from within the selected clusters (using any of the possible probability sampling methods) for a final sample. 57
    • • If more than two stages are used, the process of choosing population units within clusters continues until there is a final sample.• With multi-stage sampling, you still have the benefit of a more concentrated sample for cost reduction.• However, the sample is not as concentrated as other clusters and the sample size is still bigger than for a simple random sample size. 58
    • • Also, you do not need to have a list of all of the units in the population. All you need is a list of clusters and list of the units in the selected clusters.• Admittedly, more information is needed in this type of sample than what is required in cluster sampling. However, multi-stage sampling still saves a great amount of time and effort by not having to create a list of all the units in a population. 59
    • 1.4.2.. Non-probability sampling• The difference between probability and non-probability sampling has to do with a basic assumption about the nature of the population under study.• In probability sampling, every item has a known chance of being selected.• In non-probability sampling, there is an assumption that there is an even distribution of a characteristic of interest within the population. 60
    • • This is what makes the researcher believe that any sample would be representative and because of that, results will be accurate.• For probability sampling, random is a feature of the selection process, rather than an assumption about the structure of the population. 61
    • • In non-probability sampling, since elements are chosen arbitrarily, there is no way to estimate the probability of any one element being included in the sample.• Also, no assurance is given that each item has a chance of being included, making it impossible either to estimate sampling variability or to identify possible bias 62
    • • Reliability cannot be measured in non-probability sampling; the only way to address data quality is to compare some of the survey results with available information about the population.• Still, there is no assurance that the estimates will meet an acceptable level of error.• Researchers are reluctant to use these methods because there is no way to measure the precision of the resulting sample. 63
    • • Despite these drawbacks, non-probability sampling methods can be useful when descriptive comments about the sample itself are desired.• Secondly, they are quick, inexpensive and convenient.• There are also other circumstances, such as researches, when it is unfeasible or impractical to conduct probability sampling. 64
    • common types of non-probability sampling1. Convenience or haphazard sampling2. Volunteer sampling3. Judgment sampling4. Quota sampling5. Snowball sampling technique 65
    • 1.4.Scales of measurement• Measurement: the assignment of numbers or names or events according to a set of rules:• Clearly not all measurements are the same.• Measuring an individuals weight is qualitatively different from measuring their response to some treatment on a three category of scale, “improved”, “stable”, “not improved”.• Measuring scales are different according to the degree of precision involved.• There are four types of scales of measurement. 66
    • Scales…1. Nominal scale: uses names, labels, or symbols to assign each measurement to one of a limited number of categories that cannot be ordered. – Examples: Blood type, sex, race, marital status2. Ordinal scale: assigns each measurement to one of a limited number of categories that are ranked in terms of a graded order. – Examples: Patient status, Cancer stages 67
    • Scales…3. Interval scale: assigns each measurement to one of an unlimited number of categories that are equally spaced. It has no true zero point. – Example: Temperature measured on Celsius or Fahrenheit4.Ratio scale: measurement begins at a true zero point and the scale has equal space. – Eg: Height, weight, blood pressure 68
    • Scales… 69
    • 1.5.Validity and reliabilityValidity and Reliability are two major requirements for any measurement. – Validity pertains to the correctness of the measure; a valid tool measures what it is supposed to measure. – Reliability pertains to the consistency of the tool across different contexts.• Validity is often described as internal or external. 70
    • 1.6.Sources and methods of data Collection and it’s handling Sources Two major sourcesPrimary sources-are those data, which are collected by the investigator himself/herself for the purpose of a specific inquiry or study.Such data are original in character and are mostly generated by surveys conducted by individuals or research institutions.The first hand information obtained by the investigator is more reliable and accurate since the investigator can extract the correct information by removing doubts, if any, in the minds of the respondents regarding certain questions. High response rates might be obtained since the answers to various questions are obtained on the spot. It permits explanation of questions concerning difficult subject matter. 71
    • Secondary dataSecondary Data: When an investigator uses data, which have already been collected by others, such data are called "Secondary Data". Such data are primary data for the agency that collected them, and become secondary for someone else who uses these data for his/her own purposes.The secondary data can be obtained from journals, reports ofdifferent institutions, government publications, publications ofprofessionals and research organizations. These data are less expensive and can be collected in a short time. 72
    • Data collection methods1.Observation• is a technique that involves systematically selecting, watching and recoding behaviours of people or other phenomena and aspects of the setting in which they occur, for the purpose of getting specified information.• includes all methods from simple visual observations to the use of high level machines and measurements, sophisticated equipment or facilities, such as radiographic, biochemical, X-ray machines, microscope, clinical examinations, and microbiological examinations. 73
    • Observation…• Advantages: Gives relatively more accurate data on behaviour and activities• Disadvantages: Investigators or observer’s own biases, prejudice, desires, and etc. .• needs more resources and skilled human power during the use of high level machines. 74
    • 2. The Documentary sources• Include clinical records and other personal records, published mortality statistics, census publications, etc.• Advantages:a) Documents can provide ready-made information relatively easilyb) The best means of studying past events• Disadvantages:a) Problems of reliability and validity (because the information is collected by a number of different persons who may have used different definitions or methods of obtaining data).b) There is a possibility that errors may occur when the information is extracted from the records . 75
    • 3. Interviews and self-administered questionnairea) Interviews: may be less or more structured.A public health worker conducting interviews may be armed with a checklist of topics, but may not decide in advance precisely what questions he/she will ask.• This approach is flexible; the content, wording and order of the questions are relatively unstructured. – the content, wording and order of the questions vary from interview to interview. 76
    • Interviews…On the other hand, in other situations a more standardized technique may be used, the wording and order of the questions being decided inadvance.This may take the form of a highly structured interview(interviewing using questionnaire),• the investigator appoints persons/enumerators, who go to the respondents personally with the questionnaire, ask them questions and record their replies. – This can be done using telephone or face-to-face interviews. 77
    • Interviews…• Questions may take two general forms: they may be “open ended” questions, which the subject answers in his/her own words,• or “closed” questions, which are answered by choosing from a number of fixed alternative responses. 78
    • Advantage of interview• A good interviewer can stimulate and maintain the respondent’s interest. This leads to the frank answering of questions.• If anxiety is aroused (e.g., why am I being asked these questions?) , the interviewer can allay it.An interviewer:• can repeat questions which are not understood, and give standardized explanations where necessary.• can ask “follow-up” or “probing” questions to clarify a response.• can make observations during the interview;• i.e., note is taken not only of what the subject says but also how he/she says it. 79
    • b. self-administered questionnaire• The respondent reads the questions and fills in the answers by himself/herself (sometimes in the presence of an interviewer who “stands by” to give assistance if necessary).• The use of self-administered questionnaires is simpler and cheaper;• can be administered to many persons simultaneously (e.g. to a class of school children).• They can be sent by post. However, they demand a certain level of education on the part of the respondent. 80
    • .• Quantitative data are commonly collected using structured interviews (where standard questionnaires are common and the collected data can relatively be processed easily) where as,• qualitative data are usually collected using unstructured interviews.• The unstructured interviews are undertaken by the help of check lists, key informant interviews, focus group discussions, etc. 81
    • Qualitative…Checklist - is a list of questions prepared ahead of time to facilitate the interviews or discussions. It is not an exhaustive one. It helps the facilitator not to miss any of the important topics under consideration.Key informant interviews – interviews done with influential individuals (such as community elders, priests, etc.).Focus group discussions – discussions made with a group of respondents.• The group contains 6 to 12 people who are more or less similar with respect to level of education, marital status, age, sex, etc. (this composition helps each respondent to talk freely without being dominated by the other). 82
    • Steps in Questionnaire Design1. Before beginning to construct, make sure that the questionnaire is the best method of collecting data for your objectives – To know before hand what information is needed and what is going to be done with this information2. While drafting the questions one has to know: Why question is asked and what will be done with information (to prevent wastage of extra resources) 83
    • Steps in…3. To get valid and reliable information:• the wording and sequence of question should be able to facilitate their recall or remember• prevent forgetfulness of the respondents• avoid difficult/ time consuming or embarrassing or too personal question• the flow of questions should be from simple to complex and from general to specific, from impersonal to personal• confidentiality care should be taken for the respondent• Cover letter( if by mail)• Identify by ID(rather than name) 84
    • Data Collection and handling Process 85
    • Data collectionA plan for data collection can be made in two steps:1. Listing the tasks that have to be carried out and who should be involved, making a rough estimate of the time needed for the different parts of the study, and identifying the most appropriate period in which to carry out the research2. Actually scheduling the different activities that have to be carried out each week in a work plan 86
    • Why should you develop a plan for data collection?A plan for data collection should be developed so that: – you will have a clear overview of what tasks have to be carried out, who should perform them, and the duration of these tasks; – you can organize both human and material resources for data collection in the most efficient way; and – you can minimize errors and delays which may result from lack of planning (for example, the population not being available or data forms being misplaced). 87
    • Data collection processStages• Stage 1: Permission to proceed – Obtaining consent from the relevant authorities, individuals and the community in which the project is to be carried out 88
    • Data collection processStage 2: Data collection• Logistics – who will collect what, – when and – with what resources• Quality control – Prepare a field work manual – Select your research assistants – Train research assistants – Supervision – Checked for completeness and accuracy 89
    • Data collection process• How long will it take to collect the data for each component of the study? – Step 1: Consider the time required to reach the study area; to locate the study units; the number of visits required per study unit and for follow-up of non- respondents – Step 2: Calculate the number of interviews that can be carried out per person per day – Step 3: Calculate the number of days needed to carry out the interviews. 90
    • Ensuring data qualityMeasures to help ensure good quality of data: Prepare a field work manual for the research team as a whole Select your research assistants, if required, with care Train research assistants carefully in all topics covered in the field work manual as well as in interview techniques Pre-test research instruments and research procedures with the whole research team, including research assistants. 91
    • Ensuring data quality Take care that research assistants are not placed under too much stress Arrange for on-going supervision of research assistants and guidelines should be developed for supervisory tasks. Devise methods to assure the quality of data collected by all members of the research team. 92
    • Data Collection ProcessStage 3: Data handling• Once the data have been collected and checked for completeness and accuracy, a clear procedure should be developed for handling and storing them• Numbering of all questionnaires• Identify the person responsible for storing data and the place where it will be stored• Decide how data should be stored. Record forms should be kept in the sequence in which they have been numbered. 93
    • Research Assistants• This includes – data collectors, supervisors and may be local guides• Selection – during selection one should consider similarities in educational level and may be sex composition• Training – all research assistants and team members should be trained together 94
    • Pre-test and pilot studyA pre-test usually refers to a small-scale trial of particular research components.A pilot study is the process of carrying out a preliminary study, going through the entire research procedure with a small sample.Why do we carry out a pre-test or pilot study?A pre-test or pilot study serves as a trial run that allows us to identify potential problems in the proposed study. 95
    • Pre-test and pilot studyWhat aspects of your research methodology can be evaluated during pre-testing?1. Reactions of the respondents to the research procedures can be observed in the pre-test – availability and willingness2. The data-collection tools can be pre-tested3. Sampling procedures can be checked4. Staffing and activities of the research team can be checked, while all are involved in the pre-test5. Procedures for data processing and analysis can be evaluated during the pre-test6. The proposed work plan and budget for research activities can be assessed during the pre-test. 96
    • Plan for data processing & analysis• Data processing and analysis should start in the field, with checking for completeness of the data and• Performing quality control checks, while sorting the data by instrument used and by group of informants• Data of small samples may even be processed and analyzed as soon as it is collected. 97
    • Plan for data processing & analysis• The plan for data processing and analysis must be made after careful consideration of the objectives of the study as well as of the tools developed to meet the objectives.• The procedures for the analysis of data collected through qualitative and quantitative techniques are quite different. – For quantitative data the starting point in analysis is usually a description of the data for each variable – For qualitative data it is more a matter of describing, summarizing and interpreting the data obtained for each study unit 98
    • Plan for data processing & analysis• When making a plan for data processing and analysis the following issues should be considered: – Sorting data, –  Performing quality-control checks, –  Data processing, and –  Data analysis. 99
    • Data processing and analysis• Sorting data – Into groups of different study populations or comparison groups• Quality control checks – Check again for completeness and internal consistency – Missing data - if many exclude the questionnaire – Inconsistency - correct, return or exclude 100
    • Data processing• Decide whether to process and analyse the data from questionnaires: – manually, using data master sheets or manual compilation of the questionnaires, or – by computer, for example, using a micro-computer and existing software or self-written programmes for data analysis.• Data processing in both cases involves: • categorising the data, • coding, and • summarising the data in data master sheets, manual compilation without master sheets, or • data entry and verification by computer. 101
    • 2.Descriptive statistics (Data summarization) 102
    • 2.Data summarization(Descriptive statistics)2.1.Describing variablesThe methods of describing variables differ depending on the type of data Categorical or NumericalSome times we transform numeric data into categorical.eg age. – when lesser degree detail is required• This is achieved by dividing the range of values, which the numeric variable takes into intervals. 103
    • Describing…Categorical variables• Table of frequency distributions – Frequency – Relative frequency – Cumulative frequencies• Charts – Bar charts – Pie charts 104
    • Describing … 105
    • In summary,• There are three ways we can summarize and present data:• Tabular representation - summarizing data by making a table of the data called frequency distributions.• Graphical representation of data - we can make a graph of the data.• Numerical representation of data - we can use a single number to represent many numbers. – Measures of central tendency. – Measures of variability. 106
    • 2.2. Frequency Distribution• A frequency distribution shows the number of observations falling into each of several ranges of values.• Four different types of frequency distributions. – Simple frequency distribution (or it can be just called a frequency distribution). – Cummulative frequency distribution. – Grouped frequency distribution. – Cummulative grouped frequency distribution.• Are portrayed as Frequency tables, histograms, or polygons• Can show either the actual number of observations falling in each range or the percentage of observations. In the latter instance, the distribution is called a relative frequency distribution 107
    • Simple frequency distributionConsider the following set of data which are the hightemperatures recorded for 30 consecutive days. We wishto summarize this data by creating a frequencydistribution of the temperatures. Data Set - High Temperatures for 30 Days 50 45 49 50 43 49 50 49 45 49 47 47 44 51 51 44 47 46 50 44 51 49 43 43 49 45 46 45 51 46 108
    • Simple frequency distribution…To create a frequency distribution from thisdata proceed as follows:.1. Identify the highest and lowest values in the data set. For our temperatures the highest temperature is 51 and the lowest temperature is 43.2. Create a column with the title of the variable we are using, in this case temperature. Enter the highest score at the top, and include all values within the range from the highest score to the lowest score. 109
    • Simple frequency…3. Create a tally column to keep track of the scores as you enter them into the frequency distribution. Once the frequency distribution is completed you can omit this column4. Create a frequency column, with the frequency of each value, as show in the tally column, recorded.5. At the bottom of the frequency column record the total frequency for the distribution proceeded by N =6. Enter the name of the frequency distribution at the top of the table. 110
    • Simple frequency…If we applied these steps to the temperature data abovewe would have the following frequency distribution Frequency Distribution for High Temperatures Temperature Tally Frequency 51 //// 4 50 //// 4 49 //// / 6 48 0 47 /// 3 46 /// 3 45 //// 4 44 /// 3 43 /// 3 N = 30 111
    • Cumulative frequency distributionTo create a cummulative frequency distribution:• Create a frequency distribution• Add a column entitled cummulative frequency• The cummulative frequency for each score is the frequency up to and including the frequency for that score• The highest cummulative frequency should equal N (the total of the frequency column) 112
    • Cumulative frequency…Cummulative Frequency Distribution for High TemperaturesTemperature Tally Frequency Cummulative Frequency51 //// 4 3050 //// 4 2649 ////// 6 2248 0 1647 /// 3 1646 /// 3 1345 //// 4 1044 /// 3 643 /// 3 3 N= 30 113
    • Grouped frequency distributionTo create a grouped frequency distribution:• select an interval size so that you have 7-20 class intervals  Al so By using surges’ rule• create a class interval column and list each of the class intervals• each interval must be the same size, they must not overlap, there may be no gaps within the range of class intervals• create a tally column (optional)• create a midpoint column for interval midpoints• create a frequency column• enter N = some value at the bottom of the frequency column 114
    • Grouped frequency for the temperature dataGrouped Frequency Distribution for High TemperaturesClass Interval Tally Interval Midpoint Frequency57-59 ////// 58 654-56 /////// 55 751-53 /////////// 52 1148-50 ///////// 49 945-47 /////// 46 742-44 ////// 43 639-41 //// 40 4 N= 50 115
    • Cumulative grouped frequency distribution We just add a cumulative frequency column to the grouped frequency distribution and we have a cumulative grouped frequency distribution as shown below. Cumulative Grouped Frequency Distribution for High TemperaturesClass Interval Tally Interval Midpoint Frequency Cumulative Frequency57-59 ////// 58 6 5054-56 /////// 55 7 4451-53 /////////// 52 11 3748-50 ///////// 49 9 2645-47 /////// 46 7 1742-44 ////// 43 6 1039-41 //// 40 4 4 N= 50 116
    • Relative Frequency• Sometimes it is useful to compute the proportion, or percentages of observations in each category.• Relative frequency of a particular category is the proportion(fracttion) of observations that fall into the particular category.• The cumulative frequency (or proportions) is addition of the frequencies in each category from zero to a particular category. – Is the relative frequency of items less than or equal to the upper class limit of each class.• For quantitative data and for categorical (qualitative) data (but only if the latter are ordinal ) 117
    • Characteristics and guidelines of table constructionCharacteristics• Table must be explanatory• Title should describe the content of the table and should answer the question what? Where? And when? It was collected• Percentages in each category should add up to 100• Foot notes should be placed at the bottom of the table 118
    • Guidelines• The shape and size of the table should contain the required number of raw and Columns to accommodate the whole data• If a quantity is zero, it should be entered as zero, and leaving blank space or putting dash in place of zero is confusing and undesirable• In case two or more figures are the same, ditto marks should not be used in a table in the place of the original numerals• If any figures in a table has to be specified for a particular purpose, it should be marked with asterisk 119
    • 2.3. Diagrammatic Representation2.3.1. Importance of diagrammatic representation:1.Diagrams have greater attraction than mere figures. They give delight to the eye, add a spark of interest and as such catch the attention as much as the figures dispel it.2.They help in deriving the required information in less time and without any mental strain.3.They have great memorizing value than mere figures. This is so because the impression left by the diagram is of a lasting nature.4.They facilitate comparison 120
    • Importance….Well designed graphs can be an incredibly powerful means of communicating a great deal of information using visual techniquesWhen graphs are poorly designed, they not only do not effectively convey your message, they often mislead and confuse. 121
    • 2.3.2.Types 1. Bar graph•Bar diagram is the easiest and most adaptable general purpose chart.•Though this type of chart can be used for any type of series, it is especially satisfactory for nominal and ordinal data.•The categories are represented on the base line (X-axis) at regular interval and the corresponding values of frequencies or relative frequencies represented on the Y-axis (ordinate) in the case of vertical bar diagram and vis-versa in the case of horizontal bar diagram. 122
    • Method of constructing bar graph•All bars drawn in any single study should be of the same width•The different bars should be separated by equal distances•All the bars should rest on the same line called the base•It is better to construct a diagram on a graph paperTypes of bar graph• 1.Simple bar graph: It is one-dimensional diagram in which the bar represents the whole of the magnitude. The height/length of each bar indicates the frequency of the figure represented.Example: Construct a bar graph for the following data 123
    • Table__, Distribution of pediatric patients in X hospital ward by type of admitting diagnosis Jan, 2000 Diagnosis Number of patients Relative freq (%) Pneumonia 487 48.7 Malaria 200 20 Cardiac problems 168 16.8 Malnutrition 80 8.0 Others 65 6.5 Total 1000 100 124
    • 1. Simple bar graph…. 125
    • 2.Sub-divided (component) bar graph • It is also called segmented bar graph. If a given magnitude can be split up into subdivisions, or if there are different quantities forming the subdivisions of the totals, simple bars may be subdivided in the ratio of the various subdivisions to exhibit the relationship of the parts to the whole.• The order in which the components are shown in a "bar" is followed in all bars used in the diagram. 126
    • 2.Sub-divided… 127
    • 3. Multiple bar graphMultiple Bar diagrams can be used to represent the relationships among more than two variables.The following figure shows the relationship between children’s reports of breathlessness and cigarette smoking by themselves and their parents. 128
    • 3. Multiple bar graph… 129
    • 3. Multiple bar graph…• We can see from the graph quickly that the prevalence of the system increases both with the childs smoking and with that of their parents. 130
    • 2. Pie chartPie chart shows the relative frequency for each category by dividing a circle into sectors, the angles of which are proportional to the relative frequency.Steps to construct a pie-chart Construct a frequency table Change the frequency into percentage (P) Change the percentages into degrees, where: degree = Percentage X 360o Draw a circle and divide it accordingly 131
    • 2. Pie chart… Example: Distribution of death for females, in England and Wales, 1989. Cause of death Number (%)of deaths Circulatory system (C) 100,000• Neoplasm (N) -- 70,000 Respiratory system(R) 30,000 Injury & poisoning (I) 6,000 Digestive system (D) 10,000 Others (O) 20,000 Total 236,000 132
    • 2. Pie chart… 133
    • 3.HistogramHistograms are frequency distributions with continuous class interval that have been turned into graphs.To construct a histogram, we draw the interval boundaries on a horizontal line and the frequencies on a vertical line.Non-overlapping intervals that cover all of the data values must be used.Bars are then drawn over the intervals in such a way that the areas of the bars are all proportional in the same way to their interval frequencies. 134
    • Example: Distribution of the RBC cholinesterase values(µmol/min/ml) obtained from 35 workers Exposed to Pesticideseg. RBC cholinesterase (µmol/min/ml) Frequency, n (%) Cumulative frequency (%) 5.95-7.95 1(2.9) 2.9 7.95-9.95 8(22.9) 25.8 9.95-11.95 14(40) 65.8 11.95-13.95 9(25.7) 91.5 13.95-15.95 2(5.7) 97.2 15.95-17.95 1(2.9) 100 Total 35(100) Source: Knapp RG, Miller MC III: Clinical Epidemiology and biostatistics 135
    • 3.Histogram… Histogram of the RBC cholinesterase values of 35• . Number of pesticide exposed workers pesticide exposed workers 16 14 12 10 8 6 4 2 0 6.95 8.95 10.95 12.95 14.95 16.95 RBC choilinesterase(umol/min/ml) 136
    • 4.Frequency polygonA frequency distribution can be portrayed graphically in yet another way by means of a frequency polygon.•To draw a frequency polygon we connect the mid-point of the tops of the cells of the histogram by a straight line.•It can be also drawn without erecting rectangles as follows:The scale should be marked in the numerical values of the mid-points of intervals.Erect ordinates on the mid-point of the interval-the length or altitude of an ordinate representing the frequency of the class on whose mid-point it is erected.Join the tops of the ordinates and extend the connecting line to the scale of sizes. 137
    • 4.Frequency polygon… 138
    • 5.Cumulative frequency polygon (ogive curve)Some times it may become necessary to know the number of items whose values are more or less than a certain amount.•We may, for example, be interested in knowing the number of patients whose weight is less than 50 Kg or more than say 60 Kg.•To get this information it is necessary to change the form of the frequency distribution from a ‘simple’ to ‘cumulative distribution.•Ogive curve turns a cumulative frequency distribution in to graphs. 139
    • 5.Cumulative frequency polygon (ogive curve)…Example: Heart rate of patients admitted to Hospital B, 2000 Heart rate No. of patients Cumulative freq., less Cumulative freq., (Beat/min) than method greater than method 54.95-59.5 1 1 54 59.5-64.5 5 6 53 64.5-69.5 3 9 48 69.5-74.5 5 14 45 74.5-79.5 11 25 40 79.5-84.5 16 41 29 84.5-89.5 5 46 13 89.5-94.5 5 51 8 94.5-99.5 2 53 3 99.5-104.5 1 54 1 Total 54 140
    • 5.Cumulative frequency polygon (ogive curve) … 141
    • 6.Box-and-whisker plotIt is another way to display information when the objective is to illustrate certain location in the distribution.A box is drawn with the top of the box at the third quartile and the bottom at the first quartile.The location of the midpoint of the distribution is indicated with a horizontal line in the box.Finally, straight lines or whiskers are drawn from the center of the top of the box to the largest observation and from the center of the bottom of the box to the smallest observation.Useful When one of the characteristics is qualitative and the other is quantitative 142
    • Eg: percentage super saturation of bile by sex of patients Men Women Subject Age %Super Subject Age %Super saturation saturation 1 23 40 1 40 65 2 31 86 2 33 86 3 58 11 3 49 76. 4 5 25 63 86 106 4 5 44 63 89 142 6 43 66 6 27 58 7 67 123 7 23 98 8 48 90 8 56 146 9 29 112 9 41 80 10 26 52 10 30 66 11 64 88 11 38 52 12 55 137 12 23 35 13 31 88 13 35 55 14 20 80 14 50 127 15 23 65 15 47 77 16 43 79 16 36 91 17 27 87 17 74 128 18 63 56 18 53 75 19 59 110 19 41 82 20 53 106 20 25 89 21 66 110 21 57 84 22 48 78 22 42 116 23 27 80 23 49 73 24 32 47 24 60 87 25 62 74 25 23 76 26 36 58 26 48 107 27 29 88 27 44 84 28 27 73 28 37 120 29 65 118 29 57 123 30 42 67 31 60 57 143
    • Box-and-whisker plot… 144
    • Box-and-whisker plot• The graphs indicate the similarity of the distribution between the percentage saturation of bile in men and women.•Again, we see that percentage saturation of bile is a bit more spread out among women with range 35 to 146 but we see also that the mid-points of the distributions are almost the same and that most of the spread in values in women occurs in the upper half of the distribution. 145
    • 7.Scatter plotMost studies in medicine involve measuring more than one characteristic, and graphs displaying the relationship between two characteristics are common in the literature.• To illustrate the relationship between two characteristics when both are quantitative variables we use bivariate plots (also called scatter plots or scatter diagrams).A scatter diagram is constructed by drawing X-and Y-axes. •Each observation is represented by a point or dot(•).•In the same study on percentage saturation of bile, information was collected on the age of each patient to see whether a relationship existed between the two measures, the following plot was displayed. 146
    • 7.Scatter plot…The graph suggests the possibility of a positive relationshipbetween age and percentage saturation of bile in women. 147
    • 8.Line graphIn this type of graph, we have two variables under consideration like that of scatter diagram.•A variable is taken along X-axis and the other along Y-axis.•The points are plotted and joined by line segments in order.•These graphs depict the trend or variability occurring in the data.•Sometimes two or more graphs are drawn on the same graph paper taking the same scale so that the plotted graphs are comparable.Example:The following graph shows level of zidovudine(AZT) in the blood of AIDS patients at several times after administration of the drug, with normal fat absorption and with fat mal absorption. 148
    • Response to administration of zidovudine in two groups of AIDS patients in hospital X, 1999. 149
    • Data Summarization (Numeric Summery) 150
    • Measures of central tendencyOn the scale of values of a variable there is a certain stage at which the largest number of items tend to cluster.Since this stage is usually in the centre of distribution, the tendency of the statistical data to get concentrated at certain values is called “central tendency”The various methods of determining the actual value at which the data tends to concentrate are called measures of central tendency. 151
    • Measures of central tendency…The most important objective of calculating measure of central tendency is to determine a single figure which may be used to represent a whole series involving magnitude of the same variable.In that sense it is an even more compact description of the statistical data than the frequency distribution.•Since a measure of central tendency represents the entire data, it facilitates comparison with in one group or between groups of data. 152
    • Measures of central tendency…Characteristics of a good measure of central tendencyA measure of central tendency is good or satisfactory if it possesses the following characteristics.1.It should be based on all the observations2.It should not be affected by the extreme values3.It should be as close to the maximum number of values as possible4.It should have a definite value5.It should not be subjected to complicated and tedious calculations6.It should be capable of further algebraic treatment7.It should be stable with regard to sampling 153
    • Arithmetic mean (x)The most familiar MCT is the AM. It is also popularly known as average.a) Ungrouped dataIf x1.,x2., ..., xn are n observed values,Then: 154
    • Arithmetic mean…b) Grouped data .In calculating the mean from grouped data, we assume that all values falling into a particular class interval are located at the mid-point of the interval. It is calculated as follow: where, k = the number of class intervals mi = the mid-point of the ith class interval fi = the frequency of the ith class interval 155
    • Arithmetic mean… Example.Mean = 2630/100 = 26.3 156
    • Arithmetic mean…• The arithmetic mean possesses the following properties.• Uniqueness: For given set of data there is one and only one arithmetic mean.• Simplicity: The arithmetic mean is easily understood and easy to compute.• Center of gravity: Algebraic sum of the deviations of the given values from their arithmetic mean is always zero.• Sensitivity: The arithmetic mean possesses all the characteristics of a central value, except No.2, (is greatly affected by the extreme values).• In case of grouped data if any class interval is open, arithmetic mean can not be calculated 157
    • The Median(X)• a) Ungrouped data•The median of a finite set of values is that value which divides the set of values in to two equal parts such that the number of values greater than the median is equal to the number of values less than the median.•If the number of values is odd, the median will be the middle value when all values have been arranged in order of magnitude.•When the number of observations is even, there is no single middle observation but two middle observations. •In this case the median taken to be the mean of these two middle observations, when all observations have been arranged in the order their magnitude 158
    • The Median…b) Grouped data• In calculating the median from grouped data, we assume that the values within a class-interval are evenly distributed through the interval.• The first step is to locate the class interval in which it is located. We use the following procedure.• Find n/2 and see a class interval with a minimum cumulative frequency which contains n/2.• To find a unique median value, use the following interpolation formal. 159
    • Median…Where,Lm= lower true class boundary of the interval containingthe medianFc= cumulative frequency of the interval just above the medianclass intervalfm= frequency of the interval containing the medianW= class interval widthn = total number of observations 160
    • Median….. Examplen/2 = 75/2 = 37.5Median class interval = 35-44Lm=34.5 ,Fc= 35, W = 10, n = 75,fm=22•Median = 34.5 + (37.5-35)/22 x 10 = 35.64 161
    • Properties of the median• There is only one median for a given set of data• The median is easy to calculate• Median is a positional average and hence it is not drastically affected by extreme values• Median can be calculated even in the case of open end intervals• It is not a good representative of data if the number of items is small 162
    • Mode (x)a) Ungrouped data•It is a value which occurs most frequently in a set of values.•If all the values are different there is no mode, on the other hand, a set of values may have more than one mode.b) Grouped data• In designating the mode of grouped data, we usually refer to the modal class, where the modal class is the class interval with the highest frequency.• If a single value for the mode of grouped data must be specified, it is taken as the mid point of the modal class interval. 163
    • Properties of mode• It is not affected by extreme values• It can be calculated for distributions with open end classes• Often its value is not unique• The main drawback of mode is that often it does not exist 164
    • MEASURES OF POSITIONS Quartiles• Divide the distribution into four equal parts. The 25th percentile demarcates the first quartile (Q1),• the median or 50th percentile demarcates the second quartile (Q2),• the 75th percentile demarcates the third quartile (Q3),• and the 100th percentile demarcates the fourth quartile (Q4), which is the maximum observation. Q1 is the ¼ (n+1)th measurement, i.e, 25% of all the ranked observations are less than Q1.Q2 is 2/4 (n+1)th = (n+1 /2)th measurement. I.e. 50% of all ranked observations are less than Q2. Q2=2 Q1Q3 is the ¾ (n+1)th observation. Q3= 3 Q1. It indicates that 75% of all the ranked observations are less than Q3.  165
    • Percentile• Is Simply dividing the data into 100 pieces.• value in a set of data that has 100% of the observations at or below it. When we consider it in this way, we call it the 100th percentile.• From this same perspective, the median, which has 50% of the observations at or below it, is the 50th percentile.• The pth percentile of a distribution is the value such that p percent of the observations are less than or equal to it.The pth percentile value depends on whether np/100 is an integer or not:The (k+1) Th largest sample point if np/100 is not an integer where k is the largest integer less than np/100.The average of the (np/100) th and (np/100+1) th largest observation when np/100 is an integer 166
    • Percentiles…Example: The following data is the sample of birth weights (grams) of live births at a hospital during a week period.  3265, 3248, 2838, 3323, 3245, 3101, 2581, 3200, 4146, 2759, 3609, 2069, 3260, 3314, 3541, 3649, 3484, 2834, 2841, 3031.Calculate the 10th and 90th percentilesSolution: n=20; p=0.1 & 0.9 First put the data in ascending order 2069, 2581, 2759, 2834, 2838, 2841, 3031, 3101, 3200, 3245, 3248, 3260, 3265, 3314,3323,3484,3541,3609,3649,4146. 10th percentile = np/100= 20x0.1=2 which is an integer. So, the 10 th percentile will be the average of the 2nd and the 3rd ordered observation which is 2581+ 2759 divided by two which is equal to 2670 grams.The 90th percentile=np/100= 20x0.9=18 which is an integer. So, the 90 th percentile will be the average of the 18 th and the 19th ordered observation which is 3609+ 3649 divided by two which is equal to 3629 grams. 167
    • Percentiles…• Therefore, we would say that 80 percent of the birth weights would fall between 2607 g and 3629 g, which give us an overall feel for the spread of the distribution. • The most commonly used percentiles other than the median (50th percentile) are the 25th percentile and the 75th percentile. 168
    • Measures of variability• The measure of central tendency alone is not enough to have a clear idea about the distribution of the data.• Moreover, two or more sets may have the same mean and/or median but they may be quite different.• Thus to have a clear picture of data, one needs to have a measure of dispersion or variability (scatterdness) amongst observations in the set. 169
    • Range (R)R = XL-XS,where• XLis the largest value and XSis the smallest value.• Properties• It is the simplest measure and can be easily understood• It takes into account only two values which causes it to be a poor measure of dispersion 170
    • Interquartilerange (IQR)IQR = Q3-Q1,Where,Q3is the third quartile and Q1is the first quartile.Example: Suppose the first and third quartile for weights of girls 12 months of age are 8.8 Kg and 10.2 Kg respectively. The interrquartile range is therefore,IQR = 10.2 Kg –8.8 Kg,i.e.,50% of infant girls at 12 months weigh between 8.8 and 10.2 Kg. 171
    • Interquartilerange … 172
    • Interquartile…• Generally, we use interquartile range to describe variability when we use the median as the measure of central location. We use the standard deviation, which is described in the next section, when we use the mean.Properties• It is a simple and versatile measure• It encloses the central 50% of the observations• It is not based on all observations but only on two specific values• It is important in selecting cut-off points in the formulation of clinical standards• Since it excludes the lowest and highest 25% values, it is not affected by extreme values• It is not capable of further algebraic treatment 173
    • Quartile deviation (QD)Coefficient of quartile deviation (CQD) CQD is an absolute quantity (unit less) and is useful to compare the variability among the middle 50% observations. 174
    • Mean deviation (MD)•Mean deviation is the average of the absolute deviations taken from a central value, generally the mean or median.•Consider a set of n observations x1, x2, ..., xn.Then, Where, A is a central value (arithmetic mean or median). 175
    • Mean deviation …Properties• MD removes one main objection of the earlier measures, that it involves each value• It is not affected much by extreme values• Its main drawback is that algebraic negative signs of the deviations are ignored which is mathematically unsound• MD is minimum when the deviations are taken from median. 176
    • The Variance (σ2, S2)• The main objection of mean deviation, that the negative signs are ignored, is removed by taking the square of the deviations from the mean.• The variance is the average of the squares of the deviations taken from the mean. 177
    • Variance…a)Ungrouped dataLet X1, X2, ..., XN be the measurement on N population units, then; 178
    • Variance…The sample variance of the set x1, x2, ..., xn of n observations is: 179
    • Variance…b)Grouped data 180
    • Variance…Properties• The main demerit of variance is, that its unit is the square of the unit of measurement of variate values• The variance gives more weightage to the extreme values as compared to those which are near to mean value, because the difference is squared in variance.• The drawbacks of variance are overcome by the standard deviation. 181
    • Standard deviation (σ, S)It is the positive square root of the variance. Properties •Standard deviation is considered to be the best measure of dispersion and is used widely because of the properties of the theoretical normal curve. •There is however one difficulty with it. If the units of measurements of variables of two series is not the same, then there variability can not be compared by comparing the values of standard deviation. Formula sheet for variance and standard deviation.doc Example to calculate variance.doc 182
    • Coefficient of variation• When we desire to compare the variability in two sets of data, the standard deviation which calculates the absolute variation may lead to false results.• The coefficient of variation gives relative variation & is the best measure used to compare the variability in two sets of data. Never use SD to compare variability between groups.• CV = standard deviation Mean 183
    • 4.Basic Probability and probability distributions• Probability is a mathematical technique for predicting outcomes. It predicts how likely it is that specific events will occur.• An understanding of probability is fundamental for quantifying the uncertainty that is inherent in the decision-making process• Probability theory also allows us to draw conclusions about a population of patients based on known information about a sample of patients drawn from that population. 184
    • Basic Probability…• Mutually exclusive events: Events that cannot occur together – For example, event A=“Male” and B=“Pregnant” are two mutually exclusive events (as no males can be pregnant).• Independent events: The presence or absence of one does not alter the chance of the other being present. – one event happens regardless of the other, and its outcome is not related to the other.• Probability: If an event can occur in N mutually exclusive and equally likely ways, and if m of these possess a characteristic E, the probability of the occurrence of E is P(E) = m/N. 185
    • 4.1.Properties of probability1.A probability value must lie between 0 and 1, 0≤P(E)≤1.  A probability can never be more than 1.0, nor can it be negative• A value 0 means the event can not occur• A value 1 means the event definitely will occur• A value of 0.5 means that the probability that the event will occur is the same as the probability that it will not occur.• Probability is measured on a scale from 0 to 1.0 as shown in in the following Figure of probabilty scale. 186
    • Properties…Fig.___ 187
    • Properties…2. The sum of the probabilities of all mutually exclusive outcome is equal to 1. P(E1) + P(E2) + .... + P(En) = 13. For any two events A and B, P(A or B) = P(A) + P(B) -P(A and B) (Addition rule) For two mutually exclusive events A and B, P(A or B ) = P(A) + P(B).4. For any two independent events A and B – P(A and B) = P(A) P(B). (Multiplication rule) 188
    • Properties…• To calculate the probability of event (A) and event (B) happening (independent events)for example, if you have two identical packs of cards (pack A and pack B),what is the probability of drawing the ace of spades from both packs?• Formula: P(A) x P(B) P(pack A) = 1 card, from a pack of 52 cards = 1/52 = 0.0192 P(pack B) = 1 card, from a pack of 52 cards = 1/52 = 0.0192 P(A) x P(B) = 0.0192 x 0.0192 = 0.000375. If A’ is the complementary event of the event A, Then, P(A’) = 1 -P(A). 189
    • Example• A study investigating the effect of prolonged exposure to bright light on retina damage in premature infants. Eighteen of 21 premature infants, exposed to bright light developed retinopathy, while 21 of 39 premature infants exposed to reduced light level developed retinopathy. For this sample, the probability of developing retinopathy is:P(Retinopathy) = No. of infants with retinopathy Total No. of infants= 18 + 21 = 0.65 21 + 39 190
    • Example…• The following data are the results of electrocardiograms (ECGs) and radionuclide angiocardiograms(RAs) for 19 patients with post-traumatic myocardial contusions. A “+”indicates abnormal results and a “-”indicates normal results.• 1.Calculate the probability of both ECG and RA is abnormal• 2.Calculate the probability that either the ECG or the RA is abnormal 191
    • Example 192
    • ExampleSolutions1.P(ECG abnormal and RA abnormal) = 7/19 = 0.372.P(ECG abnormal or RA abnormal) = P(ECG abnormal) + P(RA abnormal) –P(Both ECG and RA abnormal) =17/19 + 9/19 –7/19 = 19/19 =1• NB: We can not calculate the above probability by adding the number of patients with abnormal ECGs to the number of abnormal Ras, I.e. (17+9)/19 = 1.37• The problem is that the 7 patients whose ECGs and RAs are both abnormal are counted twice 193
    • 4.2.Conditional probability• Are probabilities based on the knowledge that some event has occurred.• In the retinopathy study described in the above example, the primary concern is comparison of the bright-light infants with the reduced-light infants. We want to know whether the probability of retinopathy for the bright-light infants differs from the probability of retinopathy for the reduced-light infants.• We want to compare the probability of retinopathy, given that the infant was exposed to bright light, with that the infant was exposed to reduced light. – Exposure to bright light and exposure to reduced light are conditioning events, events we want to take into account when calculating conditional probabilities. 194
    • Conditional…• Conditional probabilities are denoted by P(A/B) (read as Probability of A given B )or P(Event/Conditioning event). The formula for calculating a sample conditional probability is: P (Event/Conditioning event) = No. of observations for which event and conditioning event both occur No. of observations for which conditioning event occurs  P(A/B)= P(A∩B) , if P(B)>0 P (B) 195
    • Conditional…• Example: For the retinopathy data, the conditional probability of retinopathy, given exposure to light, is P (Retinopathy/exposure to bright light)= No. of infants with retinopathy exposed to bright light No. of infants exposed to bright light  = 18/21= 0.86P(Retinopathy/exposure to reduced light)= No.of infants with retinopathy exposed to reduced light No. of infants exposed to reduced light = 21/39 = 0.54• The conditional probabilities suggest that premature infants exposed to bright light have a higher risk of retinopathy than premature infants exposed to reduced light. 196
    • Summary of formulas for calculating probabilitySummary of formulas for calculating probability.doc More exercises 197
    • Calculating probability of an eventTable --- shows the frequency of cocaine use by gender among adult cocaine users _______________________________________________________________________________________________ Life time frequency Male Female Total of cocaine use _______________________________________________________________________________________________ 1-19 times 32 7 39 20-99 times 18 20 38 more than 100 times 25 9 34 -------------------------------------------------------------------------------------------- Total 75 36 111 --------------------------------------------------------------------------------------------- 198
    • Questions1. What is the probability of a person randomly picked is a male?2. What is the probability of a person randomly picked uses cocaine more than 100 times?3. Given that the selected person is male, what is the probability of a person randomly picked uses cocaine more than 100 times?4. Given that the person has used cocaine less than 100 times, what is the probability of being female?5. What is the probability of a person randomly picked is a male and uses cocaine more than 100 times? 199
    • Answers1. Pr(m)=Total adult males/Total adult cocaine users =75/111 =0.68 .2. Pr(c>100)=All adult cocaine users more than 100 times/ Total adult cocaine users=34/111=0.31.3. Pr (c>100m)=25/75=0.33.4. Pr(fc<100)=(7+20)/36=27/36=0.75.5. Pr(m ∩ c>100)= Pr(m) × Pr (c>100)=75/111×25/75=25/111=0.23. 200
    • 4.3.Normal distribution• If we take a large sample of men or women, measure their heights, and plot them on a frequency distribution, the distribution will almost certainly obtain a symmetrical bell- shaped pattern known as the normal distribution (also called the Gaussian distribution).See the following fig.• The least frequently recorded heights lie at the two extremesof the curve. From the figure,it can be seen that very few women are extremely short or extremely tall. 201
    • Normal distribution…Figure3.4.Distribution of a sample of values of womens heights. 202
    • Normal distribution…• In practice, many biological measurements follow this pattern, making it possible to use the normal distribution to describe many features of a population.• It must be emphasized that some measurements do not follow the symmetrical shape of the normal distribution, and can be positively skewed or negatively skewed.• For example, more of the populations of developed Western countries are becoming obese. If a large sample of such a populations weights was to be plotted on a graph similar to that in Figure3.4.1. above, there would be an excess of heavier weights which might form a similar shape to the negatively skewed example in Figure3.4.2. below.• The distribution will therefore not fit the symmetrical pattern of the normal distribution. 203
    • Normal distribution…Figure 3.4.2.Examples of positive and negative skew. 204
    • Normal distribution… Fig.3.4.3.The Normal distribution 205
    • Normal distribution…From The normal distribution shown in Figure3.4.3. above, You can see that it is split into two equal and identically shaped halves by the mean.• The standard deviation indicates the size of the spread of the data. It can also help us to determine how likely it is that a given value will be observed in the population being studied. We know this because the proportion of the population that is covered by any number of standard deviations can be calculated.• The proportions of values below and above a specified value (e.g. the mean) can be calculated, and are known as tails.• The normal distribution is useful in a number of applications, including confidence intervals and hypothesis testing . 206
    • Properties of the normal distribution1. It is symmetrical about its mean, μ.2. The mean, the median and mode are all equal3. The total area under the curve above the x-axis is one square unit.4. The curve never touches the x-axis.5. As the value of σ increases, the curve becomes more and more flat and vice versa.6. About 68% of the values of X fall within one standard deviation of the mean, 95% of the values are found within two standard deviations of the mean and 99.7% of the values are found within three standard deviations of the mean.7.The distribution is completely determined by the parameters μ and σ.8.The mean is μ and the variance is σ2 207
    • Standard normal distribution• It is a normal distribution that has a mean equal to 0 and a standard deviation equal to 1.• Z-transformation: If a random variable X~N(μ,σ) then we can transform it to a standard normal distribution with the help of Z- transformation Z = X -μ σ 208
    • 209
    • Example1• In 1932 the Stanford-BinetIQ test was roughly normally distributed with μ= 100 and σ= 15.• Over time IQ’s have increased (better nutrition or more experience taking test??) so average IQ for present day American children taking the 1932 test would be 120 but with same σ .• “Very Superior" is an IQ above 130. – (a)What % of 1932 children were “very superior”? – (b) What % of present day children would be “very superior” on 1932 test? 210
    • SolutionLet X be 1932 IQ scores &Let Y the scores of present day children on the 1932 test.X ~ N(100, 15) & Y ~ N(120, 15)(a)P(X >130) = P(Z > (130 -100)/15)= P(Z >2.0) = 0.0228( fromZ table.)2.28 % of 1932 children were “very superior”(b) P(Y>130) = P(Z > (130-120)/15) = P(Z > 0.67) = 0.2514 25.14% of present day children are “very superior 211
    • Example 2• A data collected on systolic blood pressure in normal healthy individuals is normally distributed with μ= 120 and σ= 10 mm Hg.1)What proportion of normal healthy individuals have a systolic blood pressure above 130 mm Hg?2)What proportion of normal healthy individuals have a systolic blood pressure between 100 and 140 mm Hg?3)What level of systolic blood pressure cuts off the lower 95% of normal healthy individuals? 212
    • Solutions 213
    • 214
    • 215
    • 4.4.The Binomial distributionIt is one of the most widely encountered discrete distributions.•The origin of binomial distribution lies in Bernoulli’s trials. When a single trial of some experiment can result in only one of two mutually exclusive outcomes (success or failure; dead or alive; sick or well, male or female) the trail is called Bernoulli trial.• Suppose an event can have only binary outcomes A and B. Let the probability of A is π and that of B is 1 -π. The probability π stays the same each time the event occurs.• If an experiment repeated n times and the outcome is independent from one trial to another, the probability that outcome A occurs exactly X times is: 216
    • The Binomial distribution… 217
    • Characteristics of a Binomial Distribution•The experiment consist of n identical trials.•There are only two possible outcomes on each trial.•The probability of A remains the same from trial to trial. This probability is denoted by p, and t he probability of B is denoted by q. Note that q=1-p.•The trials are independent.•The binomial random variable X is the number of A’s in n trials.•n and π are the parameters of the binomial distribution.•The mean is nπ and the variance is nπ(1-π) 218
    • Exercise(Home work)• Each child born to a particular set of parents has a probability of 0.25 of having blood type O. If these parents have 5 children.• What is the probability that :a. Exactly two of them have blood type Ob. At most 2 have blood type Oc. At least 4 have blood type Od. 2 do not have blood type O. 219