Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Human resources section2b-textbook_on_public_health_and_community_medicine


Published on

AFMC WHO Textbook Community Medicine

  • Be the first to comment

  • Be the first to like this

Human resources section2b-textbook_on_public_health_and_community_medicine

  1. 1. Role of Statistics in Public Health and Community 38 Introduction to Biostatistics Medicine Statistics finds an extensive use in Public Health and Seema R. Patrikar Community Medicine. Statistical methods are foundations for public health administrators to understand what is happeningThe origin of statistics roots from the Greek word ‘Statis’ which to the population under their care at community level as well asmeans state. In the early days the administration of the state individual level. If reliable information regarding the disease isrequired the collection of information regarding the population available, the public health administrator is in a position to:for the purpose of war. Around 2000 years ago, in India, we ●● Assess community needshad this system of collecting administrative statistics. In the ●● Understand socio-economic determinants of healthMauryan regime the system of registration of vital events ●● Plan experiment in health researchof births and deaths existed. Ain-i-Akbari is a collection of ●● Analyse their resultsinformation gathered on various surveys conducted during the ●● Study diagnosis and prognosis of the disease for takingreign of Emperor Akbar. effective actionThe birth of statistics occurred in mid-17th century. A ●● Scientifically test the efficacy of new medicines andcommoner, named John Graunt, began reviewing a weekly methods of publication issued by the local parish clerk that listed Statistics in public health is critical for calling attention tothe number of births, christenings, and deaths in each parish. problems, identifying risk factors, and suggesting solutions,These so called Bills of Mortality also listed the causes of death. and ultimately for taking credit for our successes. The mostGraunt who was a shopkeeper organized this data, which was important application of statistics in sociology is in the fieldpublished as Natural and Political Observations made upon of demography.the Bills of Mortality. The seventeenth century contribution of Statistics helps in developing sound methods of collecting datatheory of probability laid the foundation of modern statistical so as to draw valid inferences regarding the hypothesis. It helpsmethods. us present the data in numerical form after simplifying theToday, statistics has become increasingly important with complex data by way of classification, tabulation and graphicalpassing time. Statistical methods are fruitfully applied to presentation. Statistics can be used for comparison as well asany problem of decision making where the past information to study the relationship between two or more factors. The useis available or can be made available. It helps to weigh the of such relationship further helps to predict one factor from theevidences and draw conclusions. Statistics finds its application other. Statistics helps the researcher come to valid conclusionsin almost all the fields of science. We hardly find any science in answering their research questions.that does not make use of statistics. Despite wide importance of the subject it is looked upon withDefinition of Statistics suspicion. “Lies, damned lies, and statistics” is part of a phrase attributed to Benjamin Disraeli and popularized in the UnitedDifferent authors have defined statistics differently. The best States by Mark Twain: “There are three kinds of lies: lies,definition of statistics is given by Croxton and Cowden according damned lies, and statistics.” The semi- ironic statement refersto whom statistics may be defined as the science, which deals to the persuasive power of numbers, and describes how evenwith collection, presentation, analysis and interpretation of accurate statistics can be used to bolster inaccurate arguments.numerical data. It is human psychology that when facts are supported byDefinition of Biostatistics figures, they are easily believed. If wrong figures are usedBiostatistics may be defined as application of statistical methods they are bound to give wrong conclusions and hence whento medical, biological and public health related problems. It is statistical theories are applied the figures that are used arethe scientific treatment given to the medical data derived from free of all types of biases and have been properly collected andgroup of individuals or patients. scientifically analysed.Role of Statistics in Clinical Medicine Broad Categories of StatisticsThe main theory of statistics lies in the term variability. No two Statistics can broadly be split into two categories Descriptiveindividuals are same. For example, blood pressure of person Statistics and Inferential Statistics. Descriptive statisticsmay vary from time to time as well as from person to person. deals with the meaningful presentation of data such that itsWe can also have instrumental variability as well as observers characteristics can be effectively observed. It encompasses thevariability. Methods of statistical inference provide largely tabular, graphical or pictorial display of data, condensationobjective means for drawing conclusions from the data about of large data into tables, preparation of summary measuresthe issue under study. Medical science is full of uncertainties to give a concise description of complex information and alsoand statistics deals with uncertainties. Statistical methods to exhibit pattern that may be found in data sets. Inferentialtry to quantify the uncertainties present in medical science. It statistics however refers to decisions. Medical research doesn’thelps the researcher to arrive at a scientific judgment about stop at just describing the characteristic of disease or situation.a hypothesis. It has been argued that decision making is an It tries to determine whether characteristics of a situation areintegral part of a physician’s work. Frequently, decision making unusual or if they have happened by chance. Because of thisis probability based. desire to generalize, the first step is to statistically analyse the • 218 •
  2. 2. data. Study ExercisesIn order to begin our analysis as to why statistics is necessary Short Notes : (1) Differentiate between descriptive andwe must begin by addressing the nature of science and inferential statistics (2) Describe briefly various scales ofexperimentation. The characteristic method used by researcher measurement.when he/she starts his/her experiment is to study a relatively MCQs & Exercisessmall collection of subjects, as complete population basedstudies are time consuming, laborious, costly and resource 1. An 85 year old man is rushed to the emergency departmentintensive. The researcher draws a subset of the population by ambulance during an episode of chest pain. Thecalled as “sample” and studies this sample in depth. But the preliminary assessment of the condition of the man isconclusions drawn after analyzing the sample is not restricted performed by a nurse, who reports that the patients painto the sample but is extrapolated to the population i.e. people in seems to be ‘severe’. The characterization of pain asgeneral. Thus Statistics is the mathematical method by which ‘severe’ is (a) Dichotomous (b) Nominal (c) Quantitative the uncertainty inherent in the scientific method is rigorously (d) Qualitativequantified. 2. If we ask the patient attending OPD to evaluate his pain on a scale of 0 (no pain) to 5 (the worst pain), then thisSummary commonly applied scale is a (a) Dichotomous (b) RatioIn recent times, use of Statistics as a tool to describe various scale (c) Continuous (d) Nominalphenomena is increasing in biological sciences and health related 3. For each of the following variable indicate whether it isfields so much so that irrespective of the sphere of investigation, quantitative or qualitative and specify the measurementa research worker has to plan his/her experiments in such a scale for each variable : (a) Blood Pressure (mmHg)manner that the kind of conclusions which he/she intends to (b) Cholesterol (mmol/l) (c) Diabetes (Yes/No) (d) Bodydraw should become technically valid. Statistics comes to this Mass Index (Kg/m2) (e) Age (years) (f) Sex (female/aid at the stages of planning of experiment, collection of data, male) (g) Employment (paid work/retired/housewife) (h)analysis and interpretation of measures computed during the Smoking Status (smokers/non-smokers, ex-smokers) (i)analysis. Biostatistics is defined as application of statistical Exercise (hours per week) (j) Drink alcohol (units per week)methods to medical, biological and public health related (k) Level of pain (mild/moderate/severe)problems. Statistics is broadly categorized into descriptive Answers : (1) d; (2) b; (3) (a) Quantitative continuous;statistics and inferential statistics. Descriptive statistics (b) Quantitative continuous; (c) Qualitative dichotomous ;describes the data in meaningful tables or graphs so that the (d) Quantitative continuous; (e) Quantitative continuous;hidden pattern is brought out. Condensing the complex data (f) Qualitative dichotomous; (g) Qualitative nominal ;into simple format and describing it with summary measures is (h) Qualitative nominal; (i) Quantitative discrete; (j) Quantitativepart of the descriptive statistics. Inferential statistics on other discrete; (k) Qualitative ordinal.hand, deals with drawing inferences and taking decision bystudying a subset or sample from the population. The first step in handling the data, after it has been collected Descriptive Statistics: Displaying 39 the Data is to ‘reduce’ and summarise it, so that it can become understandable; then only meaningful conclusions can be drawn from it. Data can be displayed in either tabular form or Seema R. Patrikar graphical form. Tables are used to categorize and summarize data while graphs are used to provide an overall visual representation. To develop Graphs and diagrams, we need toThe observations made on the subjects one after the other is first of all, condense the data in a table.called raw data. Raw data are often little more than jumble ofnumbers and hence very difficult to handle. Data is collected Understanding as to how the Data have beenby researcher so that they can give solutions to the research Recordedquestion that they started with. Raw data becomes useful only Before we start summarizing or further analyzing the data, wewhen they are arranged and organized in a manner that we should be very clear as on which ‘scale’ it has been recordedcan extract information from the data and communicate it to (i.e. qualitative or quantitative; and, whether continuous,others. In other words data should be processed and subjected discrete, ordinal, polychotomous or dichotomous). The detailsto further analysis. This is possible through data depiction, have already been covered earlier in the chapter on variablesdata summarization and data transformation. and scales of measurement (section on epidemiology) and the • 219 •
  3. 3. readers should quickly revise that chapter before proceeding. Child Sex Age MalnutritionOrdered Data (months) StatusWhen the data are organized in order of magnitude from 17 f 2 Normalthe smallest value to the largest value it is called as ordered 18 m 11 Normalarray. For example consider the ages of 11 subjects undergoingtobacco cessation programme (in years) 16, 27, 34, 41, 38, 53, 19 m 12 Normal65, 52, 20, 26, 68. When we arrange these ages in increasing 20 m 11 Malnourishedorder of magnitude we get ordered array as follows: 16, 20, 21 m 10 Normal26, 27, 34, 38, 41, 52, 53, 65, 68. After observing the orderedarray we can quickly determine that the youngest person is of 22 f 9 Normal16 years and oldest of 68 years. Also we can easily state that 23 f 5 Normalalmost 55% of the subjects are below 40 years of age, and that 24 f 6 Normalthe midway person is aged 38 years. 25 m 4 NormalGrouped Data - Frequency TableBesides arranging the data in ordered array, grouping of data 26 f 7 Normalis yet another useful way of summarizing them. We classify the 27 f 11 Normaldata in appropriate groups which are called “classes”. The basic 28 f 12 Normalpurpose behind classification or grouping is to help comparisonand also to accommodate a large number of observations into 29 m 10 Malnourisheda few classes only, by condensation so that similarities and 30 m 4 Normaldissimilarities can be easily brought out. It also highlights 31 m 6 Normalimportant features and pinpoints the most significant ones atglance. 32 m 8 NormalTable 1 shows a set of raw data obtained from a cross-sectional 33 m 12 Malnourishedsurvey of a random sample of 100 children under one year of 34 m 1 Malnourishedage for malnutrition status. Information regarding age and 35 m 1 Normalsex of the child was also collected. We will use this data toillustrate the construction of various tables. If we show the 36 f 3 Normaldistribution of children as per age then it is called as simple 37 m 5 Normaltable as only one variable is considered. 38 f 6 Normal Table - 1 : Raw data on malnutrition status (malnourished 39 f 8 Normal and normal) for 100 children below one year of age 40 f 9 Normal Child Sex Age Malnutrition 41 f 10 Malnourished (months) Status 42 m 1 Normal 1 f 6 Normal 43 f 12 Malnourished 2 m 4 Malnourished 44 f 2 Malnourished 3 m 2 Malnourished 45 m 1 Normal 4 m 5 Normal 46 m 6 Normal 5 m 3 Normal 47 m 4 Normal 6 f 1 Normal 48 f 9 Normal 7 m 5 Normal 49 f 4 Normal 8 f 8 Normal 50 m 9 Normal 9 f 7 Normal 51 m 7 Normal 10 f 9 Normal 52 m 6 Normal 11 f 10 Normal 53 m 4 Normal 12 f 2 Normal 54 f 2 Normal 13 m 4 Malnourished 55 m 5 Normal 14 f 6 Normal 56 m 3 Normal 15 m 8 Normal 57 f 1 Normal 16 f 1 Malnourished 58 m 5 Normal • 220 •
  4. 4. Child Sex Age Malnutrition Steps in Making a Summary Table for the Data (months) Status To group a set of observations we select a set of contiguous, 60 m 7 Malnourished non overlapping intervals such that each value in the set of observations can be placed in one and only one of the intervals. 61 m 9 Normal These intervals are usually referred to as class intervals. For 62 m 10 Normal example the above data can be grouped into different age groups of 1-4, 5-8 and 9-12. These are called class intervals. 63 f 2 Normal The class interval 1-4 includes the values 1, 2, 3 and 4. The 64 f 4 Normal smallest value 1 is called its lower class limit whereas the 65 f 6 Normal highest value 4 is called its upper class limit. The middle value of 1-4 i.e. 2.5 is called the midpoint or class mark. The number 66 f 8 Normal of subjects falling in the class interval 1-4 is called its class 67 m 1 Normal frequency. Such presentation of data in class intervals along 68 m 2 Normal with frequency is called frequency distribution. When both the limits are included in the range of values of the interval, the 69 m 11 Normal class interval are known as inclusive type of class intervals (e.g. 70 f 12 Normal 1-4, 5-8, 9-12, etc.) whereas when lower boundary is included 71 m 11 Normal but upper limit is excluded from the range of values, such class intervals are known as exclusive type of class intervals 72 m 10 Malnourished (e.g. 1-5, 5-9, 9-12 etc.) This type of class intervals is suitable 73 f 9 Normal for continuous variable. Tables can be formed for qualitative 74 f 5 Normal variables also. 75 f 6 Normal Table - 2 and 3 display tabulation for quantitative as well as qualitative variable. 76 m 4 Normal 77 m 7 Normal Table - 2 : Age distribution of the 100 children 78 m 11 Normal Age group (months) Number of children 79 f 12 Normal 1-4 36 80 f 10 Normal 5-8 33 81 m 4 Normal 9-12 31 82 m 6 Malnourished Total 100 83 m 8 Normal Table - 3 : Distribution of malnourishment in 100 84 m 12 Normal children 85 m 1 Normal Malnourishment Status Number of children 86 m 1 Normal Malnourished 17 87 m 3 Normal Normal 83 88 f 5 Normal Total 100 89 m 6 Normal Such type of tabulation which takes only one variable for 90 f 8 Normal classification is called one way table. When two variables 91 f 9 Normal are involved the table is referred to as cross tabulation or 92 f 10 Normal two way table. For example Table - 4 displays age and sex distribution of the children and Table - 5 displays distribution 93 f 1 Normal of malnourishment status and sex of children. 94 m 12 Normal 95 m 2 Normal Table - 4 : Age and sex distribution of 100 children 96 f 1 Normal Age group (months) Female Male Total 97 m 6 Normal 1-4 14 22 36 98 f 4 Malnourished 5-8 15 18 33 99 f 9 Malnourished 9-12 16 15 31100 m 4 Normal Total 45 55 100 • 221 •
  5. 5. percentages in bracket may be written on the top of each bar. Table - 5 : Malnourishment status and sex distribution When we draw bar charts with only one variable or a single of children group it is called as simple bar chart and when two variables Malnourishment Status Female Male Total or two groups are considered it is called as multiple bar chart. Malnourished 6 11 17 In multiple bar chart the two bars representing two variables are drawn adjacent to each other and equal width of the bars Normal 39 44 83 is maintained. Third type of bar chart is the component bar Total 45 55 100 chart wherein we have two qualitative variables which are further segregated into different categories or components. InHow to Decide on the Number of Class Intervals? this the total height of the bar corresponding to one variableWhen data are to be grouped it is required to decide upon the is further sub-divided into different components or categoriesnumber of class intervals to be made. Too few class intervals of the other variable. For example consider the following datawould result in losing the information. On the other hand too (Table-6) which shows the findings of a hypothetical researchmany class intervals would not bring out the hidden pattern. work intended to describe the pattern of blood groups amongThe thumb rule is that we should not have less than 5 class patients of essential hypertension.intervals and no more than 15 class intervals. To be specific,experts have suggested a formula for approximate number of Table - 6 : Distribution of blood group of patients ofclass intervals (k) as follows: essential hypertensionK= 1 + 3.332 log10N rounded to the nearest integer, where N is Number ofthe number of values or observations under consideration. Blood Group patients PercentageFor example if N=25 we have, K= 1 + 3.332 log1025 i.e. (frequency)approximately 5 class intervals. A 232 42.81Having decided the number of class intervals the next step is B 201 37.05to decide the width of the class interval. The width of the classinterval is taken as : AB 76 14.02 O 33 6.09 Maximum observed value - Minimum observed value (= Range) Total 542 100.00Width = Number of class interval (k) A simple bar chart in respect of the above data on blood groupsThe class limits should be preferably rounded figures and the among patients of essential hypertension is represented as inclass intervals should be non-overlapping and must include Fig. - 1.range of the observed data. As far as possible the percentages Similarly a multiple bar chart of the data represented in Tableand totals should be calculated column wise. - 5 of the distribution of the malnourishment status amongGraphical Presentation of Data males and females is shown in Fig. - 2.A tabular presentation discussed above shows distribution of The same information can also be depicted in the form ofsubjects in various groups or classes. This tabular representation component bar chart as in Fig. - 3.of the frequency distribution is useful for further analysis andconclusion. But it is difficult for a layman to understand complex Fig. - 1 : Distribution of blood groups of patients withdistribution of data in tabular form. Graphical presentation of essential hypertensiondata is better understood and appreciated by humans. Graphical 250representation brings out the hidden pattern and trends of thecomplex data sets. 200Thus the reason for displaying data graphically is two fold: Frequency 1501) Investigators can have a better look at the information collected and the distribution of data and, 1002) To communicate this information to others quicklyWe shall discuss in detail some of the commonly used graphical 50presentations. 0Bar Charts : Bar charts are used for qualitative type of variable A B AB Oin which the variable studied is plotted in the form of bar Blood Groupsalong the X-axis (horizontal) and the height of the bar is equalto the percentage or frequencies which are plotted along theY-axis (vertical). The width of the bars is kept constant for allthe categories and the space between the bars also remainsconstant throughout. The number of subjects along with • 222 •
  6. 6. points by a straight line then it is called as frequency polygon Fig. - 2 : Multiple Bar Chart showing the distribution of Conventionally, we consider one imaginary value immediately malnourishment status in males and females preceding the first value and one succeeding the last value and 50 44 plot them with frequency = 0. An example is given in Table - 7 45 39 and Fig. - 5. 40 35 Fig. - 4a : Distribution of patients according to blood 30 groupNumber 25 20 15 10 6 11 O 5 0 6% Malnourished Normal AB Females Males 14 % Fig. - 3 : Component Bar Chart showing the distribution of malnourishment status in males and females A : 43 % 60 50 40 B : 37 %Number 30 20 10 0 Fig. - 4b Female Male Malnourished Normal 42.81 Blood group A = X 360 = 154 degrees 100Pie Chart : Another interesting method of displaying categorical 37.08(qualitative) data is a pie diagram also called as circular Blood group B = X 360 = 134 degrees 100diagram. A pie diagram is essentially a circle in which the Blood group AB = 14.02 X 360 = 50 degreesangle at the center is equal to its proportion multiplied by 360 100(or, more easily, its percentage multiplied by 360 and dividedby 100). A pie diagram is best when the total categories  are Blood group O = 6.09 X 360 = 22 degreesbetween  2 to 6. If there are more than 6 categories, try and 100reduce them by “clubbing”, otherwise the diagram becomes tooovercrowded.A pie diagram in respect of the data on blood groups among Table - 7: Distribution of subjects as per age groupspatients  of  essential  hypertension   is presented below after Number of Age Midpointscalculating the angles  for the individual categories as in subjectsFig. - 4 a, b. 20-25 22.5 2Frequency Curve and Polygon : To construct a frequency curve 25-30 27.5 3and frequency polygon we plot the variable along the X-axisand the frequencies along the Y-axis. Observed values of the 30-35 32.5 6variable or the midpoints of the class intervals are plotted along 35-40 37.5 14with the corresponding frequency of that class interval. Then 40-45 42.5 7we construct a smooth freehand curve passing through thesepoints. Such a curve is known as frequency curve. If instead of 45-50 47.5 5joining the midpoints by smooth curve, we join the consecutive • 223 •
  7. 7. Fig. - 5 : Distribution of subjects in different age groups Fig. - 8 16 Rough estimate of 14 the centre or middle observation i.e. medianNumber of subjects 12 value (27.5) 10 8 6 4 Spread of the data 2 0 Fig. - 9 15 20 25 30 35 40 45 50 55 Age groupsStem-and-leaf plots : This presentation is used for quantitativetype of data. To construct a stem-and-leaf plot, we divide eachvalue into a stem component and leaf component. The digitsin the tens-place becomes stem component and the digits inunits-place becomes leaf components. It is of much utility inquickly assessing whether the data is following a “normal”distribution or not, by seeing whether the stem and leaf isshowing a bell shape or not. For example consider a sample of10 values of age in years : 21, 42, 05, 11, 30, 50, 28, 27, 24,52. Here, 21 has a stem component of 2 and leaf componentof 1. Similarly the second value 42 has a stem component of 4and leaf component of 2 and so on. The stem values are listedin numerical order (ascending or descending) to form a verticalaxis. A vertical line is drawn to outline a stem. If the stemvalue already exists then the leaf is placed on the right side of For the given example we notice the mound (heap) in thevertical line (Fig. - 6). middle of the distribution. There are no outliers.The value of each of the leaf is plotted in its appropriate location Histogram : The stem-and-leaf is a good way to exploreon the other side of vertical line as in Fig. - 7. distributions. A more traditional approach is to use histogram.To describe the central location, spread and shape of the stem A histogram is used for quantitative continuous type of dataplot we rotate the stem plot by 90 degrees just to explain it where, on the X-axis, we plot the quantitative exclusive typemore clearly as in Fig. - 8. of class intervals and on the Y-axis we plot the frequencies.Roughly we can say that the spread of data is from 5 to 52 The difference between bar charts and histogram is that sinceand the median value is between 27 and 28. Regarding the histogram is the best representation for quantitative datashape of the distribution though it will be difficult to make measured on continuous scale, there are no gaps between thefirm statements about shape when n is small, we can always bars. Consider an example of the data on serum cholesterol ofdetermine (Fig. - 9) : 10 subjects (Table - 8 & Fig. - 10)●● Whether data are more or less symmetrical or are extremely skewed Table - 8 : Distribution of the subjects●● Whether there is a central cluster or mound Serum●● Whether there are any outliers cholesterol No of subjects Percentage (mg/dl) Fig. - 6 Fig. - 7 175 – 200 3 30 0 0 5 200 – 225 3 30 1 1 1 225 – 250 2 20 2 2 1 4 7 8 250 – 275 1 10 3 3 0 275 – 300 1 10 4 4 2 Total 10 100% 5 5 0 2 • 224 •
  8. 8. diagram, the rate of disease are plotted along the vertical (y) Fig. - 10 : Distribution of subjects according to Serum axis. However, in localised outbreaks, with a well demarcated Cholesterol Levels population that has been at risk (as sudden outbreaks of food 3.5 poisoning) the actual numbers can be plotted on Y-axis, during 3.0 quick investigations. The unit of time, as applicable to the disease in question, is plotted along the “X”-axis (horizontal).% of subjects 2.5 This unit of time would be hours-time in food poisoning, days 2.0 (i.e, as per dates of the month) for cholera, weeks for typhoid, malaria or Hepatitis-A, months for Hepatitis-B and in years (or 1.5 even decades) for IHD or Lung Cancer. 1.0 Scatter Diagram : A scatter diagram gives a quick visual 0.5 display of the association between two variables, both of which are measured on numerical continuous or numerical discrete 0 scale. An example of scatter plot between age (in months) and 175-200 200-225 225-250 250-275 275-300 body weight (in kg) of infants is given in Fig. - 12. Serum Cholesterol Levels (mg/dl) Fig. - 12 : Scatter Diagram of the association between Age and Body Weight of infantsBox-and-Whisker plot : A box-and-whisker plot revealsmaximum of the information to the audience. A box-and- 12whisker plot can be useful for handling many data values. They Body Weight (Kgs.) 10allow people to explore data and to draw informal conclusionswhen two or more variables are present. It shows only certain 8statistics rather than all the data. Five-number summary 6is another name for the visual representations of the box-and-whisker plot. The five-number summary consists of the 4median, the quartiles (lower quartile and upper quartile), and 2the smallest and greatest values in the distribution. Thus a 0box-and-whisker plot displays the center, the spread, and theoverall range of distribution (Fig. - 11) 0 2 4 6 8 10 12 14 Age in months Fig. - 11 The scatter diagram in the above figure shows instant finding that weight and age are associated - as age increases, weight Largest Value increases. Be careful to record the dependent variable along the vertical (Y) axis  and the independent variable along the Upper Quartile (Q3) horizontal (X) axis. In this example weight is dependent on age (as age increases weight is likely to increase) but age is not dependent on weight (if weight increases, age will not  necessarily increase). Thus, weight is the dependent variable, and has been plotted on Y  axis while age is the independent variable, plotted along X axis. Median Quartile (Q2) Summary Raw information, which is just jumble of numbers, collected by the researcher needs to be presented and displayed in a manner that it makes sense and can be further processed. Data presented Lower Quartile (Q1) in an eye-catching way can highlight particular figures and situations, draw attention to specific information, highlight Smallest value hidden pattern and important information and simplify complex information. Raw information can be presented either in table i.e. tabular presentation or in graphs and charts i.e. graphical presentation. A table consists of rows and columns. The data is condensed in homogenous groups called class intervals andLine chart: Line chart is used for quantitative data. It is the number of individuals falling in each class interval calledan excellent method of displaying the changes that occur frequency is displayed. A table is incomplete without a disease frequency over time. It  thus helps in assessing Clear title describing completely the data in concise form is“temporal trends” and helps displaying data  on epidemics or written. Graphical presentation is used when data needs to belocalised outbreaks in the  form of epidemic  curve. In a line displayed in charts and graphs. A chart or diagram should have • 225 •
  9. 9. a clear title describing the data depicted. The X-axis and the Exports (crores Imports (croresY-axis should be properly defined along with the scale. Legend Year of rupees) of rupees)in case of more than one variable or group is necessary. Anoptional footnote giving the source of information may be 1960-61 610.3 624.65present. Appropriate graphical presentation should be depicted 1961-62 955.39 742.78depending on whether data is quantitative or qualitative. 1962-63 660.65 578.36While dealing with quantitative data histograms, line chart,polygon, stem and leaf and box and whisker plots should be 1963-64 585.25 527.98used whereas bar charts, pictograms and pie charts should beused when dealing with qualitative data. 9. Of the 140 children, 20 lived in owner occupied houses, 70 lived in council houses and 50 lived in private rentedStudy Exercises accommodation. Type of accommodation is a categoricalLong Question : Discuss the art of effective presentation in variable. Appropriate graphical presentation will bethe field of health, in respect of data and information; so as to (a) Line chart (b) Simple Bar chart (c) Histogramconvince the makers of decision. (d) Frequency Polygon 10. A study was conducted to assess the awareness of phimosisShort Notes: (1) Discuss the need for graphical presentation of in young infants and children up to 5 years of age. Thedata (2) Differentiate between inclusive and exclusive type of awareness level with respect to the family income is asclass intervals (3) Box and Whisker Plot (4) Scatter diagram tabulated below. Which graphical presentation is best toMCQs describe the following data?1. Which of the following is used for representing qualitative data (a) Histogram (b) Polygon (c) Pie chart (d) Line chart <2000 2000 – 5000 5000 – 8000 >80002. The scatter plot is used to display (a) Causality Aware 50 62 77 70 (b) Correlation (c) Power (d) Type II error3. Five summary plot consists of Quartiles and (a) Median (b) Unaware 50 28 23 30 Mode (c) Mean (d) Range (a) Stem & Leaf (b) Pie Chart (c) Multiple Bar Chart4. The appropriate method of displaying the changes that (d) Component Bar Chart occur in disease frequency over time (a) Line chart (b) Bar 11. Following is the frequency distribution of the serum levels chart (c) Histogram (d) Stem and leaf. of total cholesterol reported in a sample of 71 subjects.5. Box and whisker plot is also known as (a) Magical box Which graphical presentation is best to describe the (b) Four summary plot (c) Five summary plot (d) None of following data? the above6. The type of diagram useful to detect linear relationship Serum cholesterol level Frequency between two variables is (a) Histogram (b) Line Chart (c) Scatter Plot (d) Bar Chart <130 27. The following table shows the age distribution of cases of a 130-150 7 certain disease reported during a year in a particular state. 150-170 18 Which graphical presentation is appropriate to describe this data? (a) Pie chart (b) Line chart (c) Histogram 170-190 20 (d) Pictogram 190-210 15 210-230 7 Age Number of cases >230 2 5-14 5 15-24 10 (a) Stem & Leaf (b) Pie (c) Histogram 25-34 120 (d) Component Bar Chart 12. Information from the Sports Committee member on 35-44 22 representation in different games at the state level by 45-54 13 gender is as given below. Which graphical presentation is 55-64 5 best to describe the following data8. Which graphical presentation is best to describe the Different Games Females Males following data? (a) Multiple bar chart (b) Pie chart Long Jump 4 6 (c) Histogram (d) Box plot High Jump 2 4 Shot Put 9 11 Running 15 10 Swimming 5 4 • 226 •
  10. 10. (a) Box plot (b) Histogram (c) Multiple Bar Chart (d) Pie Statistical Exercise chart 1. Following is the population data in a locality, present the13. Which graphical presentation is best to describe the data in tabular form as well as using appropriate graphs. following data S. No. Age S. No. Age S. No. Age Grade of malnutrition Frequency 1 11 11 8 21 16 Normal 60 2 15 12 12 22 17 Grade I 30 3 6 13 22 23 19 Grade II 7 4 17 14 24 24 8 Grade III 2 5 18 15 16 25 9 Grade IV 1 6 7 16 19 26 10 (a) Box Plot (b) Component Bar Chart (c) Histogram (d) Pie 7 25 17 20 27 24 chart 8 32 18 9 28 31Answers : (1) c; (2) b; (3) a; (4) a; (5) c; (6) c; (7) c; (8) a; (9) b;(10) d; (11) c; (12) c; (13) d. 9 12 19 21 29 32 10 34 20 31 30 37 summing all the observations and then dividing by number of Summarising the Data: Measures x 40 of Central Tendency and Variability observations. It is generally denoted by . It is calculated as follows. Sum of the values of all observations Mean (x) = Seema R. Patrikar Total number of observations, that is, the total number ofThe huge raw information gathered by the researcher is subjects (denoted by "n")organized and condensed in a table or graphical display. Mathematically,Compiling and presenting the data in tabular or graphical form Σxiwill not give complete information of the data collected. We x = i nneed to “summarise” the entire data in one figure, looking atwhich we can get overall idea of the data. Thus, the data set It is the simplest of the centrality measure but is influenced byshould be meaningfully described using summary measures. extreme values and hence at times may give fallacious results.Summary measures provide description of data in terms of It depends on all values of the data set but is affected by theconcentration of data and variability existing in data. Having fluctuations of sampling.described our data set we use these summary figures to draw Example : The serum cholesterol level (mg/dl) of 10 subjectscertain conclusions about the reference population from which were found to be as follows: 192 242 203 212 175 284 256the sample data has been drawn. Thus data is described by two 218 182 228summary measures namely, measure of central tendency and We observe that the above data set is of quantitative type.measure of variability. Before we discuss in detail, the variousmeasures we should understand the distribution of the data To calculate mean the first step is to sum all the values. Thus, Σxiset. i = 192 + 242 + 203 + ……..+ 228 = 2192 The second step is to divide this sum by total number ofMeasures of Central Tendency observation (n), which are 10 in our example. Thus,This gives the centrality measure of the data set i.e. where the Σxiobservations are concentrated. There are numerous measures x = in = 2192/10 = 219.2of central tendency. These are : Mean; Median; Mode; GeometricMean; Harmonic Mean. Thus the average value of Serum cholesterol among the 10Mean (Arithmetic Mean) or Average subjects studied = 219.5 mg/dl. This summary value of meanThis is most appropriate measure for data following normal describes our entire data in one value.distribution but not for skewed distributions. It is calculated by • 227 •
  11. 11. Calculation of mean from grouped data : For calculating the observations are less, median can be calculated by justthe mean from a “grouped data” we should first find out the inspection. Unlike mean, median can be calculated if the extrememidpoint (class mark) of each class interval which we denote observation is missing. It is less affected by fluctuations ofby x. (Mid point is calculated by adding the upper limit and sampling than mean.the lower limit of the respective class intervals and dividing by Mode2). The next step is to multiply the midpoints by the frequencyof that class interval. Summing all these multiplications and Mode is the most common value that repeats itself in thethen dividing by total sample size yields us the mean value for data set. Though mode is easy to calculate, at times it maygrouped data. be impossible to calculate mode if we do not have any value repeating itself in the data set. At other end it may so happenConsider the following example on 10 subjects on serum that we come across two or more values repeating themselvescholesterol level (mg/dl), put in class interval (Table - 1). same number of times. In such cases the distribution are said to bimodal or multimodal. Table - 1 Geometric Mean Serum cholesterol Midpoint No. of x*f Geometric mean is defined as the nth root of the product of level (mg/dl) (x) subjects (f) observations. (a) (b) (c ) (bxc) Mathematically, 175-199 187 3 561 n x1 x2 x3......... xn * * * 200-224 212 3 636 Geometric Mean = 225-249 237 2 474 Thus if there are 3 observations in the data set, the first step would be to calculate the product of all the three observations. 250-274 262 1 262 The second step would be to take cube root of this product. 275-299 287 1 287 Similarly the geometric mean of 4 values would be the 4th root Total 10 = ∑f 2220 = ∑f of the product of the four observations. x The merits of geometric mean are that it is based on all the observations. It is also not much affected by the fluctuations ofThe mean, then is calculated as sampling. The disadvantage is that it is not easy to calculate and finds limited use in medical research. Harmonic MeanMedian Harmonic mean of a set of values is the reciprocal of the arithmeticWhen the data is skewed, another measure of central tendency mean of the reciprocals of the values. Mathematically,called median is used. Median is a locative measure which is n Harmonic mean =the middlemost observation after all the values are arranged 1 1 1in ascending or descending order. In other words median  is x1 + x2 +.... xnthat -value which divides the entire data set into 2 equal parts, Thus if there are four values in the data set as 2, 4, 6 and 8,when the data set is ordered in an ascending (or descending) the harmonic mean isfashion. In case when there is odd number of observations we 4have a single most middle value which is the median value. In = 3.84 1 1 1 1case when even number of observations is present there are 2 +4 +6 +8two middle values and the median is calculated by taking themean of these two middle observations. Thus, Though harmonic mean is based on all the values, it is not easy to understand and calculate. Like geometric mean this { n+1 ; when n is odd also finds limited use in medical research.Median = 2 mean of n th & 2 ( n + 1) th obs 2 ; when n is even Relationship between the Three Measures of Mean, Median and Mode 1. For symmetric curveLet us work on our example of serum cholesterol considered in Mean = Median = Modecalculation of mean for ungrouped data. In the first step, we 2. For symmetric curvewill order the data set in an ascending order as follows : Mean – Mode ≈ 3 (Mean – Median)175, 182, 192, 203, 212, 218, 228, 242, 256, 284 3. For positively skewed curveSince n is 10 (even) we have two middle most observations as Mean > Median > Mode212 and 218 (i.e. the 5th and 6th value) 4. For negatively skewed curve 212 + 218 Mean < Median < ModeTherefore, median = --------------- = 215 Choice of Central Tendency 2 We observe that each central tendency discussed above haveLike mean, median is also very easy to calculate. In fact if some merits and demerits. No one average is good for all types • 228 •
  12. 12. of research. The choice should depend on the type of information Quartiles divide the total number of observations into 4 equalcollected and the research question the investigator is trying parts of 25% each. Thus there are three quartiles (Q1, Q2 andto answer. If the collected data is of quantitative nature and Q3) which divide the total observations in four equal parts.symmetric or approximately symmetric data, generally the The second quartile Q2 is equivalent to the middle value i.e.measure used is arithmetic mean. But if the values in the series median. The interquartile range gives the middle 50% values ofare such that only one or two observations are very big or very the data set. Though interquartile range is easy to calculate itsmall compared to other observations, arithmetic mean gives suffers from the same defects as that of range.fallacious conclusions. In such cases (skewed data) median Mean Deviationor mode would give better results. In social and psychologicalstudies which deals with scored observations or data which Mean deviation is the mean of the difference from a constantare not capable of direct quantitative measurements like socio- ‘A which can be taken as mean, median, mode or any constant ’economic status, intelligence or pain score etc., median or mode observation from the data. The formula for mean deviation isis better measure than mean. However, ‘mode’ is generally not given as follows:used since it is not amenable to statistical analysis.Measures of Relative Position (Quantiles) Mean deviation =Quantiles are the values that divide a set numerical data arranged where A may be mean, median, mode or a constant; xi is thein increasing order into equal number of parts. Quartiles divide value of individual observations; n is the total number ofthe numerical data arranged in increasing order into four equal observations; and, ∑ = is a sign indicating “sum of”. The mainparts of 25% each. Thus there are 3 quartiles Q1, Q2 and Q3 drawback of this measure is that it ignores the algebraic signsrespectively. Deciles are values which divide the arranged data and hence to overcome this drawback we have another measureinto ten equal parts of 10% each. Thus we have 9 deciles which of variability called as Variance.divide the data in ten equal parts. Percentiles are the valuesthat divide the arranged data into hundred equal parts of 1% Standard Deviationeach. Thus there are 99 percentiles. The 50th percentile, 5th Variance is the average of the squared deviations of each of thedecile and 2nd quartile are equal to median. individual value from the mean ( x ). It is mathematically given as follows:Measures of VariabilityKnowledge of central tendency alone is not sufficient forcomplete understanding of distribution. For example if we have Variance =three series having the same mean, then it alone does not throwlight on the composition of the data, hence to supplement it Most often we use the square root of the variance calledwe need a measure which will tell us regarding the spread of Standard Deviation to describe the data as it is devoid of anythe data. In contrast to measures of central tendency which errors. Variance squares the units and hence standard deviationdescribes the center of the data set, measures of variability by taking square root brings the measure back in the samedescribes the variability or spreadness of the observation from units as original and hence is best measure of variability. It isthe center of the data. Various measures of dispersion are as given as follows:follows.●● Range Standard Deviation (SD)=●● Interquartile range●● Mean deviation●● Standard deviation The larger the standard deviation the larger is the spread of the●● Coefficient of variation distribution.Range Note: When n is less than 30, the denominator in variance andOne of the simplest measures of variability is range. Range standard deviation formula changes to (n-1).is the difference between the two extremes i.e. the difference Let us demonstrate its calculations using our hypothetical databetween the maximum and minimum observation. set on serum cholesterol (Table - 2). Range = maximum observation - minimum observationOne of the drawbacks of range is that it uses only extreme ; since n<30 Standard Deviation (SD)=observations and ignores the rest. This variability measure - 1is easy to calculate but it is affected by the fluctuations ofsampling. It gives rough idea of the dispersion of the data. (739.84 + 519.84 + ... + 77.44)Interquartile Range = 10 - 1As in the case of range difference in extreme observations isfound, similarly interquartile range is calculated by taking 10543.6difference in the values of the two extreme quartiles. Thus SD = = 34.227 9 Interquartile range = Q3 - Q1 • 229 •
  13. 13. Table - 2 Coefficient of Variation (CV)= Sr. Serum No cholesterol ( = 219.2 ) If the coefficient of variation is greater for one data set it 1 192 192-219.2 = -27.2 (-27.2) =739.84 2 suggests that the data set is more variable than the other data set. 2 242 242-219.2 = 22.8 (22.8)2 = 519.84 Thus, any information that is collected by the researcher needs 3 203 -16.2 262.44 to be described by measures of central tendency and measures 4 212 -7.2 51.84 of variability. Both the measures together describe the data. 5 175 -44.2 1953.64 Measures of central tendency alone will not give any idea about the data set without measure of variability. Descriptive 6 284 64.8 4199.04 Statistics is critical because it often suggests possible hypothesis 7 256 36.8 1354.24 for future investigation. 8 218 -1.2 1.44 Summary 9 182 -37.2 1383.84 Raw information is organized and condensed by using tabular 10 228 8.8 77.44 and graphical presentations, but compiling and presenting the data in tabular or graphical form will not give complete Total 2192 10543.6 information of the data collected. We need to “summarise” the entire data in one figure, looking at which we can get overallCalculation of Standard deviation in a grouped data : For idea of the data. Thus, the data set should be meaningfullygrouped data the calculation of standard deviation slightly described using summary measures. Summary measureschanges. It is given by following formula. provide description of data in terms of concentration of data and variability existing in data. Having described our data set ; replace n by n-1 if observations we use these summary figures to draw certain conclusions are less than 30 about the reference population from which the sample data =n has been drawn. Thus data is described by two summary measures namely, measures of central tendency and measureswhere fi is the frequency (i.e. number of subjects in that group) of variability. Measures of central tendency describe theand is the overall mean. Suppose the data on serum cholesterol centrality of the data set. In other words central tendency tellswas grouped, as we had demonstrated earlier in this chapter us where the data is concentrated. If the researcher is dealingfor calculation of the mean for grouped data. We had calculated with quantitative data, mean is the best centrality measurethe mean as 222. Now in the same table, make more columns whereas in qualitative data median and mode describes theas in Table - 3. data appropriately. Measures of variability give the spreadness or the dispersion of the data. In other words it describes theThus, scatter of the individual observations from the central value. 10250 The simplest of the variability measure is range which is 33.74 difference between the two extreme observations. Various 9 -1 = n - 1 measures of dispersion are mean deviation, variance and standard deviation. Standard deviation is the most commonlyCoefficient of Variation used variability measure to describe quantitative data and is devoid of any errors. When commenting on the variabilityBesides the measures of variability discussed above, we have while dealing with two or more groups or techniques, specialone more important measure called the coefficient of variation measure of variability called coefficient of variation is used.which compares the variability in two data sets. It measures the The group in which coefficient of variation is more is said to bevariability relative to the mean and is calculated as follows: Table - 3 Serum cholesterol fi* Midpoint (x) No. of subjects (f) level (mg/dl) 175-199 187 3 (187-222)= -35 (-35)2=1225 3*1225=3675 200-224 212 3 (212-222)= -10 100 300 225-249 237 2 15 225 450 250-274 262 1 40 1600 1600 275-299 287 1 65 4225 4225 Total 10 = ∑f 7375 10250 • 230 •
  14. 14. more variable than the other. Both measures of central tendency 9. 10 babies are born in a hospital on same day. All weighand measures of variability together describe the data set and 2.8 Kg each; What would be the standard deviationoften suggest possible hypothesis for future investigation. (a) 0.28 (b) 1 (c) 2.8 (d) 0 10. To compare the variability in two populations we use thisStudy Exercises measure (a) Range (b) Coefficient of Variation (c) MedianShort Notes : (1) Measures of central tendency (2) Measures (d) Standard deviationof Variation Answers : (1) a; (2) a; (3) d; (4) c; (5) a; (6) b; (7) c; (8) a;MCQs (9) d; (10) b.1. Which of the Statistical average takes into account all the Statistical Exercises numbers equally? (a) Mean (b) Median (c) Mode (d) None 1. A researcher wanted to know the weights in Kg of children of the above of second standard collected the following information on2. Which of the following is a measure of Spread (a) Variance, 15 students: 10, 20, 11, 12, 12, 13, 11, 14, 13, 13, 15, 11, (b) Mean (c) p value (d) Mode 16, 17, 18. What type of data is it? Calculate mean, median3. Which of the following is a measure of location and mode from the above data. Calculate mean deviation (a) Variance (b) Mode (c) p value (d) Median and standard deviation. (Answer : Mean = 13.7, Median4. Which among the following is not a measure of variability: = 13, Mode = 11&13, Mean deviation = 2.34, Standard (a) Standard deviation (b) Range (c) Median (d) Coefficient deviation = 2.9) of Variation 2. If the height (cm) of the same students is 95, 110, 98,5. For a positively skewed curve which measure of central 100, 102, 102, 99, 103,104, 103,106, 99, 108,108,109. tendency is largest (a) Mean (b) Mode (c) Median (d) All What type of data is it? What is the scale of measurement? are equal Calculate mean, median and mode from the above data.6. Most common value that repeats itself in the data set is (a) Calculate mean deviation and standard deviation. Between Mean (b) Mode (c) Median (d) All of the above. height and weigh which is more variable and why? (Answer7. Variance is square of (a) p value (b) Mean deviation : Mean =103.1, Median = 103, Mode = 99,102,103 (c) Standard deviation (d) Coefficient of variation. &108, Mean deviation = 3.55, Standard deviation = 4.4,8. Percentiles divides the data into _____ equal parts (a) 100 Coefficient of variation of weight = 21.17, Coefficient of (b) 50 (c) 10 (d) 25 variation of height = 4.27 hence weight is more variable) Introducing Inferential Statistics : Fig. - 1 41 Gaussian Distribution and Central Limit Theorem Seema R. PatrikarThe Gauassian Distribution or NormalCurveIf we draw a smooth curve passing through the mid points ofthe bars of histogram and if the curve is bell shaped curve thenthe data is said to be roughly following a normal distribution.Many different types of data distributions are encountered inmedicine. The Gaussian or “normal” distribution is among themost important. Its importance stems from the fact that thecharacteristics of this theoretical distribution underline manyaspects of both descriptive and inferential statistics (Fig. - 1). • 231 •
  15. 15. Gaussian distribution is one of the important distributions Fig. 3 shows the area enclosed by 1, 2 and 3 SD from statistics. Most of the data relating to social and physicalsciences conform to the distribution for sufficiently large Fig. - 3observations by virtue of central limit theorem.Normal distribution was first discovered by mathematicianDe-Moivre. Karl Gauss and Pierre-Simon Laplace used this 68 %distribution to describe error of measurement. Normaldistribution is also called as ‘Gaussian distribution’. 95 %A normal curve is determined entirely by the mean and the 99.7 %standard deviation. Hence it is possible to have various normalcurves with different standard deviations but same mean (Fig. Mean-3 SD Mean-2SD Mean-1SD Mean+1SD Mean+2SD Mean+3SD- 2a) and various normal curves with different means but samestandard deviation (Fig. - 2b). If these criteria are not met, then the distribution is not aThe normal curve possesses many important properties and Gaussian or normal of extreme importance in the theory of errors. The normal Standard Normal Variate (SNV)distribution is defined by following characteristics: As already specified, a normal frequency curve can be described●● It is a bell shaped symmetric (about the mean) curve. completely with the mean and standard deviation values.●● The curve on either side of the mean is mirror image of the Even the same set of data would provide different value for other side. the mean and SD, depending on the choice of measurement.●● The mean, median and mode coincide. For example, the same persons height can be expressed as 66 inches or 167.6 cms. An infant’s birth weight can be recorded Fig. - 2a: Normal curves with same mean but different as 2500 gms or 5.5 pounds. Because the units of measurement standard deviations differ, so do the numbers, although the true height and weight are the same. To eliminate the effect produced by the choice of units of measurement the data can be put in the unit free form or the data can be normalized. The first step to transform the original variable to normalized variable is to calculate the mean and SD. The normalized values are then calculated by subtracting mean from individual values and dividing by SD. These normalized values are also called the z values. x- z = σ μ (where x is the individual observation, µ = mean and σ= standard deviation) Fig. 2b : Normal curves with same standard deviation but The distribution of z always follows normal distribution, with different means mean of 0 and standard deviation of 1. The z values are often called the ‘Standard Normal Variate’. Central Limit Theorem (CLT) The CLT is responsible for the following remarkable result: The distribution of an average tends to be Normal, even when the distribution from which the average is computed is non- Normal.●● Highest frequency (frequency means the number of Furthermore, this normal distribution will have the same mean observations for a particular value or in a particular class as the parent distribution, AND, variance equal to the variance interval) is in the middle around the mean and lowest at of the parent distribution divided by the sample size (σ/n). both the extremes and frequency is decreasing smoothly The central limit theorem states that given a distribution with on either side of the mean. a mean μ and variance σ², the sampling distribution of the●● The total area under the curve is equal to 1 or 100%. mean approaches a normal distribution with a mean (μ) and●● The most important relationship in the normal curve is the a variance σ²/N as N, the sample size, increases. The amazing area relationship. and counter-intuitive thing about the central limit theoremThe proportional area enclosed between mean and multiples of is that no matter what the shape of the original distribution,SD is constant. the sampling distribution of the mean approaches a normal Mean ± 1 SD -------> 68% of the total area distribution. Furthermore, for most distributions, a normal distribution is approached very quickly as N increases. Thus, Mean ± 2 SD -------> 95% of the total area the Central Limit theorem is the foundation for many statistical Mean ± 3 SD -------> 99% of the total area procedures. • 232 •