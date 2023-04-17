Successfully reported this slideshow.
data organization and presentation.pptx

Apr. 17, 2023
How to organize and present data

How to organize and present data

  1. 1. Chapter two DATA ORGANIZATION AND PRESENTATION Mengistu Y. (BSC, MPH-HI) 2017 1 4/17/2023
  2. 2. Learning objectives At the end of this section students are expected to: • understand the nature of data • organize and present data according to the need of the activity • present data in table and graphical ways for information use. 2 4/17/2023
  3. 3. Data organization and presentation • Statistics is used to organize and interpret research observations and findings. • Before interpretation & communication of the findings, the raw data must be organized and presented in a clear and understandable way.  Techniques used to organize and summarize a set of data in a concise way. – Organization of data – Summarization of data – Presentation of data 3 4/17/2023
  4. 4. Cont... • Numbers that have not been summarized and organized are called raw data Descriptive statistic includes tables, graphical /chart displays and calculation of summary measures such as mean, proportions, averages etc… • The methods of describing variables differ depending on the type of data (Numerical or Categorical). 4 4/17/2023
  5. 5. Organizing data Categorical data • Table of frequency distributions – Frequency – Relative frequency – Cumulative frequencies • Graphs – Bar charts – Pie charts Continuous or discrete data • Frequency distribution • Summary measures  Graphs – Histograms – Frequency polygons – Cumulative frequency polygons  Leaf and steam  Box and whisker Plots  Scatter plot 5 4/17/2023
  6. 6. Frequency distributions • A frequency distribution is a presentation of the number of times (or the frequency) that each value (or group of values) occurs in the study population. • Ordered array: A simple arrangement of individual observations in order of magnitude. • A simple and effective way of summarizing categorical data is to construct a frequency distribution table. • This is done by counting the number of observations falling into each of the categories, or levels of the variables. • Consider for example, the variable birth weight with levels ‘Very low ’, ‘Low’, ‘Normal’ and ‘Big’. 6 4/17/2023
  7. 7. Relative Frequency • Sometimes it is useful to compute the proportion, or percentages of observations in each category. • The distribution of proportions is called the relative frequency distribution of the variable. • Given a total number of observations, the relative frequency distribution is easily derived from the frequency distribution. 7 4/17/2023
  8. 8. Cumulative frequency • Two other distributions are useful describing particularly ordinal data. • It tells nothing in nominal data. E.g. You will never say 70% are below blue color. • The cumulative frequency is the number of observations in the category plus observations in all categories smaller than it. • Cumulative relative frequency is the proportion of observations in the category plus observations in all categories smaller than it, and is obtained by dividing the cumulative frequency by the total number of observations. 8 4/17/2023
  9. 9. Table 2. Distribution of birth weight of newborns between 1976-1996 at TAH. BWT Freq. Rel. Freq(%) Cum. Freq Cum.rel.freq.(%) Very low 43 0.4 43 0.4 Low 793 8.0 836 8.4 Normal 8870 88.9 9706 97.3 Big 268 2.7 9974 100_____ Total 9974 100 9 4/17/2023
  10. 10. Frequency distribution for numerical data • Ordered array, further useful summarization may be achieved by grouping the data. • To group a set of observations we select a set of continuous, non overlapping intervals such that each value in the set of observations can be placed in one, and only one, of the intervals. • These intervals are usually referred to as class intervals. 10 4/17/2023
  11. 11. • One of the first considerations when data are to be grouped is how many intervals to include • The question is how best can we organize such data. Imagine when we have huge data set which may not be manageable by eye. 4/17/2023 11
  12. 12. Table 3. Frequencies of serum cholesterol levels for 1067 US males of ages 25-34, (1976-1980). ------------------------------------------------------------------------------------------------------------------------------- Cholesterol level Mg/100ml freq Relative freq Cum freq Cum.rel. freq ------------------------------------------------------------------------------------------------------------------ 80-119 13 1.2 13 1.2 120-159 150 14.1 163 15.3 160-199 442 41.4 605 56.7 200-239 299 28.0 904 84.7 240-279 115 10.8 1019 95.5 280-319 34 3.2 1053 98.7 320-359 9 0.8 1062 99.5 360-399 5 0.5 1067 100 ------------------------------------------------------------------------------------------------------------------ Total 1067 100 12 4/17/2023
  13. 13. For both discrete and continuous data the values are grouped into non-overlapping intervals, usually of equal width. 13 4/17/2023
  14. 14. Example of raw data of age…. 14 4/17/2023
  15. 15. Example of categorized data of age 15 4/17/2023
  16. 16. How to calculate class interval?  To determine the number of class intervals and the corresponding width, we use:  Sturge’s rule: K=1+3.322(logn) W=L-S K where K = number of class intervals n = no. of observations W = width of the class interval L = the largest value S = the smallest value 16 4/17/2023
  17. 17. Example • Construct a grouped frequency distribution of the following data on the amount of time (in hours) that 80 college students devoted to leisure activities during a typical school week: 4/17/2023 17
  18. 18. Example: 4/17/2023 18
  19. 19. The amount of time (in hours) that 80 college students devoted to leisure activities during a typical school week • Using the above formula, K = 1 + 3.322  log (80) = 7.32  7 classes • Maximum value = 38 and Minimum value = 10 • w= Range/k = (38 – 10)/7= 28/7 = 4 • Using width of 5(common rule of thumb), we can construct grouped frequency distribution for the above data as: 4/17/2023 19
  20. 20. 4/17/2023 20
  21. 21. Mid-point and True-limits Mid-point (class mark): The value of the interval which lies midway between the lower and the upper limits of a class. True limits(class boundaries): Are those limits that make an interval of a continuous variable continuous in both directions Used for smoothening of the class intervals Subtract 0.5 from the lower and add it to the upper limit 21 4/17/2023
  22. 22. Contd… • Note. In the construction of cumulative frequency distribution, if we start the cumulation from the lowest size of the variable to the highest size, the resulting frequency distribution is called `Less than cumulative frequency distribution' and if the cumulation is from the highest to the lowest value the resulting frequency distribution is called `more than cumulative frequency distribution.' The most common cumulative frequency is the less than cumulative frequency 4/17/2023 22
  23. 23. Example Time (Hours) True limit Mid-point Frequency 10-14 15-19 20-24 25-29 30-34 35-39 9.5 – 14.5 14.5 – 19.5 19.5 – 24.5 24.5 – 29.5 29.5 – 34.5 34.5 - 39.5 12 17 22 27 32 37 8 28 27 12 4 1 Total 80 23 4/17/2023
  24. 24. • Class interval: The length of the class, it is given by the difference between class boundaries for 1st class, the interval is 5. • Note: As sample increases, and interval reduced the sample distribution resembles the population distribution 4/17/2023 24
  25. 25. – Class intervals should be continuous, non overlapping, mutually exclusive and exhaustive – Too few intervals results loss of information – Too many intervals results that the objective of summarization will not be met. – Class intervals generally should be of the same width (some times impossible) – Open ended class intervals should be avoided 25
  26. 26. Exercise • Construct a grouped frequency distribution and complete the following table for the Age of patients (years) in a diabetic clinic in Addis Ababa, 2010 4/17/2023 26
  27. 27. Age of patients (years) in a diabetic clinic in Addis Ababa, 2010 Age group (Years) Class limit Class Boundary Class Mid Point Tally Fr. (fi) Relative Frequency , Fraction (%) Cumulative freq Relative Cum freq <Method >Method <Method >Method Total 4/17/2023 27
  28. 28. METHOD OF DATA PRESENTATION 4/17/2023 28
  29. 29. Data table Guidelines for constructing tables • Keep them simple • Limit the number of variables • All tables should be self-explanatory • Include clear title telling what, where and when • Clearly label the rows and columns 29 4/17/2023
  30. 30. Cntd… • State clearly the unit of measurement used • Explain codes and abbreviations in the foot- note • Show totals • If data is not original, indicate the source in foot-note 4/17/2023 30
  31. 31. Graphical presentation of data • Variety of graph styles can be used to present data. • The most commonly used types of graph are pie charts, bar diagrams, histograms, frequency polygon and scatter diagrams. • The purpose of using a graph is to tell others about a set of data quickly, allowing them to grasp the important characteristics of the data. • In other words, graphs are visual aids to rapid understanding. 31 4/17/2023
  32. 32. Importance of graphs • Diagrams have greater attraction than mere figures. • They give delight to the eye, add a spark of interest and as such catch the attention • They help in deriving the required information in less time and without any mental strain. • They have great memorizing value than mere figures. • They facilitate comparison 4/17/2023 32
  33. 33. Bar charts • Bar chart: Display the frequency distribution for nominal or ordinal data. • In a bar chart the various categories into which the observation fall are represented along horizontal axis and • A vertical bar is drawn above each category such that the height of the bar represents either the frequency or the relative frequency of observation within the class. • The vertical axis should always start from 0 but the horizontal can start from any where. • The bars should be of equal width and should be separated from one another so as not to imply continuity 33 4/17/2023
  34. 34. Figure 1. Bar charts showing frequency distribution of the variable ‘BWT’. 0 1000 2000 3000 4000 5000 6000 Very low Low Normal Big BWT Freq. 0 20 40 60 80 100 Verylow Low Normal Big BWT Rel. Freq. 34 4/17/2023
  35. 35. Bar charts for comparison • Multiple bar chart: In order to compare the distribution of a variable for two or more groups, bars are often drawn along side each other for groups being compared in a single bar chart. • Sub division bar chart: If there are different quantities forming the sub-divisions of the totals, simple bars may be sub-divided in the ratio of the various sub-divisions to exhibit the relationship of the parts to the whole. 35 4/17/2023
  36. 36. Fig 2. Bar chart indicating categories of birth weight of 9975 newborns grouped by antenatal follow-up of the mothers 9 88.9 2.1 7.9 89 3.1 0 10 20 30 40 50 60 70 80 90 100 Low Normal Big BWT Percent Yes No 36 4/17/2023
  37. 37. Example: Plasmodium species distribution for confirmed malaria cases, Zeway, 2003 37 4/17/2023
  38. 38. Pie chart Pie Chart: Displays the frequency distribution for nominal or ordinal data. • In a pie chart the various categories into which the observation fall are represented along sectors of a circle • Each sector represents either the frequency or the relative frequency of observation within the class the angles of which are proportional to frequency or the relative frequency. 38 4/17/2023
  39. 39. Figure 3. Pie charts showing frequency distribution of the variable ‘BWT’ Fig 3(b) Pie chart indicating relative frequencyof categories of birth weight 0.4 8 88.9 2.7 Very low Low Normal Big Fig 3(a) Pie chart indicating frequencyof categories of birth weight 43 793 8870 268 Verylow Low Normal Big 39 4/17/2023
  40. 40. Histogram • Histogram is frequency distributions with continuous class interval that has been turned into graph. • Given a set of numerical data, we can obtain impression of the shape of its distribution by constructing a histogram. • A histogram is constructed by choosing a set of non-overlapping intervals (class intervals) and counting the number of observations that fall in each class. . 40 4/17/2023
  41. 41. Histograms cont… • The number of observations in each class is called the frequency. Hence histograms are also called frequency distributions • It is necessary that the class intervals be non-overlapping so that each observation falls in one and only one interval. 4/17/2023 41
  42. 42. Histograms cont… • Except for the two boundaries, class intervals are usually chosen to be of equal width. If this is not the case, the histogram could give a misleading impression of the shape of the data • In drawing the histogram , smoothening of class interval is one of important point. We subtract 0.5 from the lower and add it up to the upper boundary of the given interval. 42 4/17/2023
  43. 43. Example Distribution of the age of women at the time of marriage Age group No. of women 15-19 11 20-24 36 25-29 28 30-34 13 35-39 7 40-44 3 45-49 2 43 4/17/2023
  44. 44. Age of women at the time of marriage 0 5 10 15 20 25 30 35 40 14.5-19.5 19.5-24.5 24.5-29.5 29.5-34.5 34.5-39.5 39.5-44.5 44.5-49.5 Age group No of women 44 4/17/2023
  45. 45. Fig 5. A histogram displaying frequency distribution of birth weight of newborns at Tikur Anbessa Hospital Birth weight 5200 4800 4400 4000 3600 3200 2800 2400 2000 1600 1200 800 2000 1800 1600 1400 1200 1000 800 600 400 200 0 Std. Dev = 502.34 Mean = 3126 N = 9975.00 45 4/17/2023
  46. 46. Frequency polygons • Instead of drawing bars for each class interval, sometimes a single point is drawn at the mid point of each class interval and consecutive points joined by straight line. • Graphs drawn in this way are called frequency polygons . • Frequency polygons are superior to histograms for comparing two or more sets of data. 46 4/17/2023
  47. 47. Fig.6. Frequency polygon of birth weight of 9975 newborns at Tikur Anbessa Hospital for males and females Birth Weight 5000 4500 4000 3500 3000 2500 2000 1500 1000 500 % 50 40 30 20 10 0 SEX Males Females 47 4/17/2023
  48. 48. Box and Whisker Plot It is another way to display information when the objective is to illustrate certain locations (skewness) in the distribution Can be used to display a set of discrete or continuous observations using a single vertical axis – only certain summaries of the data are shown 48 4/17/2023
  49. 49. Box plot cont...  A box is drawn with the top of the box at the third quartile (75%) and the bottom at the first quartile (25%).  The location of the mid-point (50%) of the distribution is indicated with a horizontal line in the box.  Finally, straight lines, or whiskers, are drawn from the centre of the top of the box to the largest observation and from the centre of the bottom of the box to the smallest observation. 49 4/17/2023
  50. 50. Box cont.... The box plot is then completed  Draw a vertical bar from the upper quartile to the largest non-outlining value in the sample  Draw a vertical bar from the lower quartile to the smallest non-outlying value in the sample  Any values that are outside the IQR but are not outliers are marked by the whiskers on the plot (IQR = P75 – P25) 50 4/17/2023
  51. 51. Box plots are useful for comparing two or more groups of observations 51 4/17/2023
  52. 52. Drawing Box-and -whiskers plot Raw data 35, 29, 44, 72, 34, 64, 41, 50, 54, 104, 39, 58 Order the data 29 34 35 39 41 44 50 54 58 64 72 104 Median = (44 + 50)/2 = 47 = Q2 Q1 = 37 Q3 = 61,Min = 29 , Max = 104 52 4/17/2023
  53. 53. Box plot Example 0 10 20 30 40 50 60 70 80 90 100 110 . . . . Min = 29 Q2 = 47 Q1 = 37 Q3 = 61 Max = 104 53 4/17/2023
  54. 54. Scatter plot Most studies in medicine involve measuring more than one characteristic, and graphs displaying the relationship between two characteristics are common in literature. When both the variables are qualitative then we can use a multiple bar graph. When one of the characteristics is qualitative and the other is quantitative, the data can be displayed in box and whisker plots 54 4/17/2023
  55. 55. Scatter plot …. For two quantitative variables we use bivariate plots (also called scatter plots or scatter diagrams). It is used to see whether a relationship existed between the two measures. A scatter diagram is constructed by drawing X-and Y-axes Each point represented by a point or dot() represents a pair of values measured for a single study subject =POSTIVE RELATION 55 4/17/2023
  56. 56. 0 2 4 6 8 10 12 14 16 18 20 0 10 20 30 40 50 60 Hours of Training Negative Correlation as x increases, y decreases x = hours of training y = number of accidents Scatter Plots and Types of Correlation Accidents 56
  57. 57. 300 350 400 450 500 550 600 650 700 750 800 1.50 1.75 2.00 2.25 2.50 2.75 3.00 3.25 3.50 3.75 4.00 Math SAT Positive Correlation as x increases y increases x = SAT score y = GPA GPA Scatter Plots and Types of Correlation 57
  58. 58. 80 76 72 68 64 60 160 150 140 130 120 110 100 90 80 Height IQ No linear correlation x = height y = IQ Scatter Plots and Types of Correlation 58
  59. 59. 1. Direction of Relationship Positive Negative X X Y Y Scatter Diagram… 4/17/2023 59
  60. 60. 2. Form of Relationship Linear Curvilinear X Y X Y 4/17/2023 60
  61. 61. 3. Degree of Relationship Strong Weak X Y X Y 4/17/2023 61
  62. 62. Line graph  Useful for assessing the trend of particular situation overtime. e.g. monitoring the trend of epidemics.  The time, in weeks, months or years, is marked along the horizontal axis  Values of the quantity being studied is marked on the vertical axis.  Values for each category are connected by continuous line.  Sometimes two or more graphs are drawn on the same graph taking the same scale so that the plotted graphs are comparable. 62 4/17/2023
  63. 63. No. of microscopically confirmed malaria cases by species and month at Zeway malaria control unit, 2003 0 300 600 900 1200 1500 1800 2100 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Months No. of confirmed malaria cases Positive P. falciparum P. vivax 63 4/17/2023
  64. 64. Line graph cont.. The following graph shows level of zidovudine (AZT) in the blood of HIV/AIDS patients at several times after administration of the drug, for with normal fat absorption and with fat mal absorption.  Line graph can be also used to depict the relationship between two continuous variables like that of scatter diagram. 64 4/17/2023
  65. 65. Line graph cont….. Response to administration of zidovudine in two groups of AIDS patients in hospital X, 1999 0 1 2 3 4 5 6 7 8 10 20 70 80 100 120 170 190 250 300 360 Time since administration (Min.) Blood zidovudine concentration Fat malabsorption Normal fat absorption 65 4/17/2023
  66. 66. Choosing graphs Type of Data/or Purpose Appropriate Graphs Metric/Numerical -Histogram (one continuous var) -Frequency Polygon (one/more cont. var) -Cumulative Freq Polygon (ogive curve) -Box and whisker (one cont. and one cat. Var) -Stem and Leave (one cont. var) -Scatter (two cont. var) Categorical -Bar (one/more cat. var) (Simple/Multiple) -Pie (one cat. var) Trend -Line (one cont. and one cat. Var/two cont) 4/17/2023 66
  67. 67. THANK YOU! 67 4/17/2023

Editor's Notes

  • This is so because the impression left by the diagram is of a lasting nature.
  • 08/28/06
  • 08/28/06
  • 08/28/06

