Basic Statistical Concepts and Methods


Published on

Statistics is the science of dealing with numbers.
 It is used for collection, summarization, presentation and analysis of data.
Statistics provides a way of organizing data to get information on a wider and more formal (objective) basis than relying on personal experience (subjective).

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Basic Statistical Concepts and Methods

  1. 1. Basic Statistical Concepts and Methods <ul><li>Ahmed-Refat AG Refat </li></ul><ul><li>FOM-ZU </li></ul>
  2. 2. Definition of Statistics <ul><li>Statistics is the science of dealing with numbers . </li></ul><ul><li>  It is used for c ollection , s ummarization , p resentation and a nalysis of data. </li></ul><ul><li>Statistics provides a way of organizing data to get information on a wider and more formal (objective) basis than relying on personal experience (subjective). </li></ul>
  3. 3. Uses of medical statistics <ul><li>Medical statistics are used in </li></ul><ul><li>1 -   Planning , monitoring and evaluating community health care programs. </li></ul><ul><li>2-   Epidemiological research studies. </li></ul><ul><li>3-    Diagnosis of community health problems. </li></ul><ul><li>4-    Comparison of health status and diseases in different countries and in one country over years. </li></ul><ul><li>5-    To form standards for the different biological measurements as weight, height. </li></ul><ul><li>6-   To differentiate between diseased and normal groups. </li></ul>
  4. 4. Types of data <ul><li>Any aspect of an individual that is measured, is called variable. Variables are either </li></ul><ul><li>1-Quantitative or 2-Qualitative . </li></ul><ul><li>1-     Quantitative data : it is numerical data. </li></ul><ul><li>Discrete data : are usually whole numbers , such as number of cases of certain disease, number of hospital beds (no decimal fraction). </li></ul><ul><li>Continuous data: it implies the measurement on a continuous scale e.g. height, weight, age (a decimal fraction can be present ). </li></ul>
  5. 5.     1- Quantitative data <ul><li>. </li></ul><ul><li>    Quantitative data : it is numerical data. </li></ul><ul><li>Tow Types </li></ul><ul><li>A- Discrete data : are usually whole numbers , such as number of cases of certain disease, number of hospital beds (no decimal fraction). </li></ul><ul><li>B- Continuous data: it implies the measurement on a continuous scale e.g. height, weight, age (a decimal fraction can be present ). </li></ul>
  6. 6. 2- Qualitative data <ul><li>   Qualitative data : It is non numerical data and is subdivided into Two Types: </li></ul><ul><li>  A- Categorical : data are purely descriptive and imply no ordering of any kind such as sex, area of residence. </li></ul><ul><li>  B- Ordinal data : are those which imply some kind of ordering like </li></ul><ul><li>-          Level of education: </li></ul><ul><li>-          Socio-economic status: </li></ul><ul><li>-          Degree of severity of disease: </li></ul>
  7. 7. Presentation Of Data <ul><li>The first step in statistical analysis is to present data in an easy way to be understood. </li></ul><ul><li>The two basic ways for data presentation are: </li></ul><ul><li>Tabular presentation. </li></ul><ul><li>Graphical presentation </li></ul>
  8. 8. Tabulation <ul><li>Some rules for the construction tables: </li></ul><ul><li>1- The table must be self-explanatory. </li></ul><ul><li>2- Title : written at the top of table to define precisely the content, the place and the time. </li></ul><ul><li>3- Clear heading of the columns and rows and units of measurements </li></ul><ul><li>4- The size of the table depends on the number of classes. Usually lie between 2 and 10 rows or classes . Its selection depends on the form of data and the requirement of the distribution. Too small may obscure some information and too long will not differ from raw data. </li></ul>
  9. 9. Types of tables <ul><li>For Qualitative data, draw a simple table eg., List Table : count the number of observations ( frequencies) in each category. </li></ul><ul><li>For Quantitative data, we have to form a frequency distribution Table </li></ul><ul><li>List tables (2 columns- one value for each measured variable) </li></ul><ul><li>Frequency Distribution Tables </li></ul>
  10. 10. Types of tables <ul><li>: List : </li></ul><ul><li>A table consisting of two   columns , the first giving an identification of the observational unit and the second giving the value of variable for that unit. </li></ul><ul><li>Example : number of patients in each hospital department are </li></ul><ul><li>Medicine 100 patients </li></ul><ul><li>Surgery 80 “ </li></ul><ul><li>ENT 28 “ </li></ul><ul><li>Ophthalmology 30 “ </li></ul>
  11. 11. Frequency Distribution tables <ul><li>FDTs are used for presentation of qualitative ( and quantitative Discrete) data, By recording the number of observations in each category. </li></ul><ul><li>These counts are called frequencies . </li></ul><ul><li>…………………………………… . </li></ul><ul><li>No Classes ….. No Intervals </li></ul>
  12. 12. <ul><li>FDT for Quantitative Continuous Data consists of a series of classes (intervals) together with the number of observations ( frequency) whose values fall within the interval of each class. </li></ul>Frequency Distribution tables
  13. 13. Frequency Distribution tables <ul><li>EXAMPLE (1) Assume we have a group of 20 individuals whose blood groups were as followed : A , AB, AB, O, B, A, A, B, B, AB, O, AB, AB, A, B, B, B, A, O, A . We want to present these data by table . </li></ul><ul><li>????? Type of data >>>>>>…… </li></ul>
  14. 14. How to Construct a Frequency Distribution tables <ul><li>Four Steps </li></ul><ul><li>Title, Table, No , % </li></ul><ul><li>1-   Put a title </li></ul><ul><li>2-   Draw Columns & Rows </li></ul><ul><li>3-  Enumerate the individuals in each category </li></ul><ul><li>4-  Calculate The relative frequency (%) </li></ul>
  15. 15. How to Construct a Frequency Distribution tables <ul><li>Four Steps </li></ul><ul><li>1-   Put a title eg., </li></ul><ul><li>Distribution of the studied individuals according to their blood group. </li></ul><ul><li>2-   Draw a table (Columns & Rows) , </li></ul><ul><ul><li>First column > Studied Variable “ Blood Group”, </li></ul></ul><ul><ul><li>2 nd column heading >“ Frequency - Number ” </li></ul></ul><ul><ul><li>3 rd column heading > “ Percentage % ” </li></ul></ul>
  16. 16. Frequency Distribution tables <ul><li>3-   Enumerate the individuals in each blood group , i.e. individuals with blood group A are 6 and those with blood group B are 6 , AB are 5 and blood group O are 3. </li></ul><ul><li>Make sure that the total number of individuals in all blood groups is 20 (the number of the studied group). </li></ul>
  17. 17. Frequency Distribution tables <ul><li>4- Calculate The relative frequency (%) of each blood group by dividing the frequency of that group over the total number of individuals and multiplied by 100 </li></ul><ul><li>i.e. the percentage of group A = 6 / 20 x 100, and the same for group AB = 5 / 20 x 100 and group O = 3 / 20 x 100. The final table will be : </li></ul>
  18. 18. Frequency Distribution tables What is Your Conclusion?
  19. 19. Frequency Distribution tables <ul><li>We can conclude from this table that blood groups A & B are the most common groups and the rarest is group O (depending on the percentage of each group). </li></ul><ul><li>So presenting data in table is beneficial in deducing facts and simplify information than raw data. </li></ul>
  20. 20. Frequency Distribution tables <ul><li>EXAMPLE (3) : The Following data are Systolic Blood Pressure measurements (mmHg) of 30 patients with hypertension . Present these data in frequency table : </li></ul><ul><li>150, 155, 160, 154, 162, 170, 165, 155, 190, 186, 180, 178, 195, 200, 180,156, 173, 188, 173, 189, 190, 177, 186, 177, 174, 155, 164, 163, 172, 160. </li></ul><ul><li>??????? Type of Data </li></ul>
  21. 21. Frequency Distribution tables <ul><li>Four Steps </li></ul><ul><li>1-   Put a title eg., </li></ul><ul><li>Frequency distribution of blood pressure measurements (mmHg) among a group of hypertensive patients . </li></ul><ul><li>2-   Draw a table (Columns & Rows) , </li></ul><ul><ul><li>First column > Studied Variable “ Blood Pressure-mm Hg”, </li></ul></ul><ul><ul><li>2 nd column heading >“ Frequency - Number ” </li></ul></ul><ul><ul><li>3 rd column heading > “ Percentage % ” </li></ul></ul>
  22. 22. Frequency Distribution tables <ul><li>3-In the first column we have to classify blood pressure into categories or classes because we have a large sample (N=30) </li></ul><ul><li>and the measured variable is of continuous type (not discrete as in the previous examples). </li></ul>
  23. 23. Frequency Distribution tables construction of classes <ul><li>  Calculate the Range of observation: subtract the lowest value of blood pressures from the highest value (the highest was 200 and the lowest was 150) the difference is 50 . </li></ul><ul><li>Determine the number of classes and the width class intervals Let class interval be 10 , so we will have 50/10 = 5 classes. </li></ul><ul><li>Enumerate the Frequency By Tally Methods </li></ul><ul><li>Calculate the Exact Frequncy & Relative frequency </li></ul>
  24. 24. Frequency Distribution tables construction of classes <ul><li>  Determine the the number of classes You want to display ( not too few ~2 and too frequent >8. it is a matter of trial and sense !!! </li></ul><ul><li>Let class interval = 10 mmHg , we will have 5 classes . </li></ul><ul><li>If we choose 5 mmHg as a class interval-width we will obtain 10 classes (too long table). </li></ul><ul><li>We must maintain constant width for all intervals . </li></ul><ul><li>Choose the upper and lower limits of the class start with the lowest value i.e 150 </li></ul><ul><li>List the intervals in order every 10 </li></ul>
  25. 26. 2-Graphical Presentation <ul><li>The diagram should be: </li></ul><ul><li>  Simple </li></ul><ul><li>Easy to understand </li></ul><ul><li>Save a lot of words </li></ul><ul><li>Self explanatory </li></ul><ul><li>Has a clear title indicating its content </li></ul><ul><li>Fully labeled </li></ul><ul><li>The y axis (vertical) is usually used for frequency </li></ul>
  26. 27. 2-Graphical Presentation <ul><li>Graphic presentations used to illustrate and clarify information. Tables are essential in presentation of scientific data and diagrams are complementary to summarize these tables in an easy, attractive and simple way. </li></ul>
  27. 28. Graphical Presentation 1- Bar chart <ul><li>It is used for presenting discrete or qualitative data . </li></ul><ul><li>It represent the measured value (or %) by separated rectangles of constant width and its lengths proportional to the frequency </li></ul><ul><li>Type: </li></ul><ul><ul><ul><li>>>>Simple , </li></ul></ul></ul><ul><ul><ul><li>>>> Multiple, </li></ul></ul></ul><ul><ul><ul><li>>>>Components </li></ul></ul></ul>
  28. 29. Graphical Presentation 1- Bar chart- Simple
  29. 30. Graphical Presentation 1- Bar chart <ul><li>Multiple bar chart : Each observation has more than one value represented, by a group of bars. Percentage of males and females in different countries, percentage of deaths from heart diseases in old and young age, mode of delivery (cesarean or vaginal) in different female age groups. </li></ul>
  30. 31. Graphical Presentation 1- Bar chart-Multiple <ul><li>Multiple bar chart : </li></ul>Cancer Anemia Males Females
  31. 32. Graphical Presentation 1- Bar chart <ul><li>Component bar chart : subdivision of a single bar to indicate the composition of the total divided into sections according to their relative proportion. </li></ul>
  32. 33. Graphical Presentation 1- Bar chart <ul><li>Component bar chart : </li></ul><ul><li>For example two countries are compared in their socio-economic standard of living, each bar represent one country, the height of the bar is 100, it is divided horizontally into 3 components (low, moderate and high classes) of socio-economic classes (SE), each class is represented by different color or shape. </li></ul>
  33. 34. Graphical Presentation 1- Bar chart- Component
  34. 35. Graphical Presentation 2-Pie diagram: <ul><li>Consist of a circle whose area represents the total frequency (100%) which is divided into segments . </li></ul><ul><li>Each segment represents a proportional composition of the total frequency. </li></ul>
  35. 36. Graphical Presentation 2-Pie diagram:
  36. 37. Graphical Presentation 3- Histogram: <ul><li>It is very similar to the bar chart with the difference that the rectangles or bars are adherent (without gaps). </li></ul><ul><li>It is used for presenting class frequency table (continuous data). </li></ul><ul><li>Each bar represents a class and its height represents the frequency (number of cases), its width represent the class interval. </li></ul>
  37. 38. Graphical Presentation 3- Histogram:
  38. 39. Graphical Presentation 4 -Frequency Polygon <ul><li>Derived from a histogram by connecting the mid points of the tops of the rectangles in the histogram. </li></ul><ul><li>The line connecting the centers of histogram rectangles is called frequency polygon. </li></ul><ul><li>We can draw polygon without rectangles so we will get simpler form of line graph . </li></ul><ul><li>A special type of frequency polygon is the Normal Distribution Curve. </li></ul>
  39. 40. Graphical Presentation 5 - Scatter diagram <ul><li>- It is useful to represent the relationship between two numeric measurements , each observation being represented by a point corresponding to its value on each axis </li></ul>
  40. 41. This scatter diagram showed a positive or direct relationship between NAG and albumin/creatinine among diabetic patients
  41. 42. <ul><li>In negative correlation, the points will be scattered in downward direction, meaning that the relation between the two studied measurements is controversial i.e. if one measure increases the other decreases. As shown in the following graph </li></ul>
  42. 43. Graphical Presentation 6- Line graph: it is diagram showing the relationship between two numeric variables (as the scatter) but the points are joined together to form a line (either broken line or smooth curve)
  43. 44. Normal Distribution Curve
  44. 45. Normal Distribution curve <ul><li>NDC is a Graphical Presentation < Frequency Polygon> of any Quantitative Biologic Variables </li></ul><ul><li>The Normal Distribution Curve is the frequency polygon of a quantitative variable measured in large number. </li></ul><ul><li>It is a form of presentation of frequency distribution of biologic variables such as weights, heights, hemoglobin level and blood pressure or any continuous data. </li></ul><ul><li>It occupies a major role in the techniques of statistical analysis. </li></ul>
  45. 47. Characteristics of Normal Distribution curve <ul><li>1-   It is bell shaped, continuous curve. </li></ul><ul><li>2-   It is symmetrical i.e. can be divided into two equal halves vertically. </li></ul><ul><li>3-   The tails never touch the base line but extended to infinity in either direction. </li></ul><ul><li>4-  T he mean , median and mode values coincide </li></ul><ul><li>5-  I t is described by two parameters: arithmetic mean determine the location of the center of the curve and standard deviation represents the scatter around the mean. </li></ul>
  46. 48. Areas under the normal curve <ul><li>X ± 1 SD = 68% of the area on each side of the mean. </li></ul><ul><li>X ± 2 SD = 95% of area on each side of the mean. </li></ul><ul><li>X ± 3 SD = 99% of area on each side of the mean. </li></ul>
  47. 49. Skewed data <ul><li>If we represent a collected data by a frequency polygon graph and the resulted curve does not simulate the normal distribution curve (with all its characteristics) </li></ul><ul><li>then these data are not normally distributed </li></ul>
  48. 50. Causes of Skewed Curve Not Normally Distributed Data <ul><li>The curve may be skewed to the right or to the left side </li></ul><ul><li>This is because The data collected are from: </li></ul><ul><li>certain heterogeneous group </li></ul><ul><li>or from diseased or abnormal population </li></ul><ul><li>therefore the results obtained from these data can not be applied or generalized on the whole population. </li></ul>
  49. 51. <ul><li>NDC can be used in distinguishing between normal from abnormal measurements . </li></ul><ul><li>Example: </li></ul><ul><li>If we have NDC for hemoglobin levels for a population of normal adult males with mean ± SD = 11 ±1.5 </li></ul><ul><li>If we obtain a hemoglobin reading for an individual = 8.1 and we want to know if he/she is normal or anemic. </li></ul><ul><li>If this reading lies within the area under the curve at 95% of normal (i.e. mean ± 2 SD) he /she will be considered normal. If his reading is less then he is anemic. </li></ul>
  50. 52. <ul><li>The normal range for hemoglobin in this example will be: </li></ul><ul><li>the higher level of hemoglobin : 11 + 2 ( 1.5 ) =14. </li></ul><ul><li>The lower hemoglobin level 11 – 2 ( 1.5 ) = 8. </li></ul><ul><li>i.e the normal range of hemoglobin of adult males is from 8 to 14. </li></ul><ul><li>our sample (8.1 ) lies within the 95% of his population. </li></ul><ul><li>therefore this individual is normal because his reading lies within the 95% of his population. </li></ul>
  51. 53. Data Summarization <ul><li>To summarize data, we need to use one or two parameters that can describe the data. </li></ul><ul><li>Measures of Central tendency which describes the center of the data </li></ul><ul><li>and the Measures of Dispersion , which show how the data are scattered around its center. </li></ul>
  52. 54. Measures of central tendency <ul><li>Variable usually has a point (center) around which the observed values lie. These averages are also called measures of central tendency. The three most commonly used averages are: </li></ul><ul><li>The arithmetic mean: </li></ul><ul><li>The Median </li></ul><ul><li>The Mode </li></ul>
  53. 55. 1- The arithmetic mean: <ul><li>the sum of observation divided by the number of observations: </li></ul><ul><li>x = ∑ x </li></ul><ul><li>n </li></ul><ul><li>Where : x = mean </li></ul><ul><li>∑ denotes the (sum of) </li></ul><ul><li>x the values of observation </li></ul><ul><li>n the number of observation </li></ul>
  54. 56. <ul><li>Example: In a study the age of 5 students were: 12 , 15, 10, 17, 13 </li></ul><ul><li>Mean = sum of observations / number of observations </li></ul><ul><li>Then the mean X = (12 + 15 + 10 + 17 + 13) / 5 =13.4 years </li></ul>1- The arithmetic mean:
  55. 57. Calculation of Mean For frequency Distribution Data <ul><li>In case of frequency distribution data we calculate the mean by this equation: </li></ul><ul><li>x = ∑ fx </li></ul><ul><li>n </li></ul><ul><li>where f = frequency </li></ul><ul><li>for example : we want to calculate the mean incubation period of this group. </li></ul>
  56. 58. Calculation of Mean For frequency Distribution Data
  57. 59. <ul><li>If data is presented in frequency table with class intervals we calculate mean by the same equation summation of f x1 /n , x1 denotes the midpoint of class interval. </li></ul><ul><li>Example : calculate the mean of blood pressure of the following group : </li></ul>Calculation of Mean For frequency Distribution Data with class intervals
  58. 62. 2- Median <ul><li>It is the middle observation in a series of observation after arranging them in an ascending or descending manner. </li></ul><ul><li>The rank of median for is (n + 1)/2 if the number of observation is odd </li></ul><ul><li>and n/2 if the number is even </li></ul>
  59. 63. <ul><li>   Calculate the median of the following data 5, 6, 8, 9, 11 n = 5~ Odd!! </li></ul><ul><li>-The rank of the median = n + 1 / 2 </li></ul><ul><li>i.e. (5+ 1)/ 2 = 3 </li></ul><ul><li>The median is the third value in these groups when data are arranged in ascending (or descending) manner. </li></ul><ul><li>-          So the median is 8 (the third value) </li></ul>2- Median
  60. 64. <ul><li>-   If the number of observation is even , the median will be calculated as follows: </li></ul><ul><li>e.g. 5, 6, 8, 9 n = 4 </li></ul><ul><li>- The rank of median = n / 2 i.e. 4 / 2 = 2 .The median is the second value of that group. If data are arranged ascendingly then the median will be 6 and if arranged descendingly the median will be 8 therefore the median will be the mean of both observations i.e. (6 + 8)/2 =7. </li></ul>2- Median
  61. 65. <ul><li>For simplicity we can apply the same equation used for odd numbers i.e. n + 1 / 2. The median rank will be 4 + 1 /2 = 2 ½ i.e. the median will be the second and the third values i.e. 6 and 8, take their mean = 7. </li></ul>2- Median
  62. 66. <ul><li>The most frequent occurring value in the data is the mode and is calculated as follows: </li></ul><ul><li>Example: 5, 6, 7, 5, 10. The mode in this data is 5 since number 5 is repeated twice. Sometimes, there is more than one mode and sometimes there is no mode especially in small set of observations. </li></ul>3- Mode
  63. 67. <ul><li>Example : 20 , 18 , 14, 20, 13, 14, 30, 19. There are two modes 14 and 20. </li></ul><ul><li>Example : 300, 280 , 130, 125 , 240 , 270 . Has no mode . </li></ul><ul><li>Unimodal Bimodal Nomodal </li></ul>3- Mode
  64. 68. Advantages and disadvantages of the measures of central Tendency: <ul><li>- Mean: is the preferred CTM since it takes into account each individual observation but its main disadvantage is that it is affected by the extreme valus of observations. </li></ul>
  65. 69. <ul><li>Median: it is a useful descriptive measure if there are one or two extremely high or low values. </li></ul><ul><li>-Mode : is seldom used. </li></ul>Advantages and disadvantages of the measures of central Tendency:
  66. 70. Measures of Dispersion <ul><li>The measure of dispersion describes the degree of variations or scatter or dispersion of the data around its central values: (dispersion = variation = spread = scatter). </li></ul><ul><li>Range - R </li></ul><ul><li>Variance - V </li></ul><ul><li>Standard Deviation - SD </li></ul><ul><li>Coefficient of Variation - COV </li></ul>
  67. 71.    1- Range: <ul><li>is the difference between the largest and smallest values. </li></ul><ul><li>is the simplest measure of variation. </li></ul><ul><li>disadvantages , it is based only on two of the observations and gives no idea of how the other observations are arranged between these two. </li></ul><ul><li>Also, it tends to be large when the size of the sample increases </li></ul>
  68. 72. <ul><li>If we want to get the average of differences between the mean and each observation in the data, </li></ul><ul><li>we have to reduce each value from the mean </li></ul><ul><li>and then sum these differences and divide it by the number of observation. V = ∑ (mean – x i ) / n </li></ul>   2- Variance
  69. 73. <ul><li>Variance V = ∑ (mean – x) / n </li></ul><ul><li>The value of this equation will be equal to zero </li></ul><ul><li>because the differences between each value and the mean will have negative and positive signs that will equalize zero on algebraic summation. </li></ul>   2- Variance
  70. 74. <ul><li>To overcome this zero we square the difference between the mean and each value so the sign will be always positive . Thus we get: </li></ul><ul><li>V = ∑ (mean – x) 2 / n - 1 </li></ul>   2- Variance
  71. 75. 3- Standard Deviation SD <ul><li>The main disadvantage of the variance is that it is the square of the units used. So, it is more convenient to express the variation in the original units by taking the square root of the variance. This is called the standard deviation (SD). Therefore SD = √ V </li></ul><ul><li>i.e. SD = √ ∑ (mean – x) 2 / n - 1 </li></ul>
  72. 76. <ul><li>The coefficient of variation expresses the standard deviation as a percentage of the sample mean. </li></ul><ul><li>C. V = SD / mean * 100 </li></ul><ul><li>C.V is useful when, we are interested in the relative size of the variability in the data. </li></ul><ul><li>Example : if we have observations 5, 7, 10, 12 and 16. Their mean will be 50/5=10. SD = √ (25+9 +0 + 4 + 36 ) / (5-1) = √ 74 / 4 = 4.3 </li></ul><ul><li>C.V. = 4.3 / 10 x 100 = 43% </li></ul>4- Coefficient of variation CoV
  73. 77. Example <ul><li>Calculate the mean, variance, SD and CV From the following measurements </li></ul><ul><li>5, 7, 10, 12 and 16. </li></ul><ul><li>Mean= 5+7+10+12+16 / 5 =10. </li></ul><ul><li>SD = √ (25+9 +0 + 4 + 36 ) / (5-1) = </li></ul><ul><li>√ 74 / 4 = 4.3 </li></ul><ul><li>C.V . = 4.3 / 10 x 100 = 43% </li></ul>
  74. 78. <ul><li>Another observations are 2, 2, 5, 10, and 11. Their mean = 30 / 5 = 6 </li></ul><ul><li>  SD = √ (16 + 16 + 1 + 16 + 25)/(5 –1) = √ 74 / 4 = 4.3 </li></ul><ul><li>C.V = 4.3 /6 x 100 = 71.6 % </li></ul><ul><li>Both observations have the same SD but they are different in C.V. because data in the first group is homogenous (so C.V. is not high), while data in the second observations is heterogenous (so C.V. is high). </li></ul><ul><li>  </li></ul>Example
  75. 79. <ul><li>Example: In a study where age was recorded the following were the observed values: 6, 8, 9, 7, 6. and the number of observations were 5. </li></ul><ul><li>Calculate the mean, SD and range, mode and median. </li></ul><ul><li>-          The mean = sum of observation / their number </li></ul>Example
  76. 80. <ul><li>The variance = Sum of the squared differences (mean minus observation) / number of observations. (7.2 – 6) 2 + (7.2 – 8) 2 + (7.2 – 9) 2 + (7.2 – 7) 2 + (7.2 – 6) 2 / 5 – 1. which is equal to (1.2) 2 + (- 0.8) 2 + (- 1.8) 2 +(0.2) 2 + (1.2) 2 / 4 = 1.7 </li></ul><ul><li>- So the variance = 1.7 </li></ul>Examples
  77. 81. <ul><li>- The S.D. = √ 1.7 = 1.3 </li></ul><ul><li>-          Range = 9 – 6 = 3 </li></ul><ul><li>-          The mode is 6 </li></ul><ul><li>-          The median is : first we have to arrange data ascendingly i.e. 6 – 6 – 7 – 8 – 9. </li></ul><ul><li>The rank of median = n + 1 / 2 i.e. 5 + 1 / 2 = 3 therefore the median is the third value i.e. median = 7 </li></ul>Examples
  78. 82. Inferential statistics <ul><li>Inference involves making a Generalization about a larger group of individuals on the basis of a subset or sample. </li></ul>
  79. 83. Inferential statistics Hypothesis Testing <ul><li>In hypothesis testing we want to find out whether the observed variation among sampling is explained by chance alone ???? (i.e., the chance of random sampling variations ) , or due to a real difference ???? between groups. </li></ul>
  80. 84. Hypothesis Testing <ul><li>It involves conducting a test of statistical significance quantifying the chance of random sampling variations that may account for observed results. </li></ul><ul><li>In hypotheses testing, we are asking whether the sample mean for example is consistent with a certain hypothesis value for the population mean . </li></ul>
  81. 85. Hypothesis Testing <ul><li>The method of assessing the hypotheses testing is known as significance test . </li></ul><ul><li>The significance testing is a method for assessing whether a result is likely to be due to chance or due to a real effect .   </li></ul>
  82. 86. Hypothesis Testing –Steps <ul><li>>>> Formulate Hypothesis </li></ul><ul><li>>>> Collect the Data </li></ul><ul><li>>>>> Test Your Hypothesis </li></ul><ul><li>>>> Accept of Reject Your Hypothesis </li></ul>
  83. 87. Null and alternative hypotheses <ul><li>In hypotheses testing, a specific hypothesis ( Null and alternative Hypothesis ) are formulated and tested. </li></ul><ul><li>The null hypotheses H0 means : X1=X 2 </li></ul><ul><li>Or X1-X 2=0 </li></ul><ul><li>this means that there is no difference between x1 and x2 </li></ul><ul><li>The alternative hypotheses H1 means </li></ul><ul><li>X1>X2 or X1< X2 </li></ul>
  84. 88. Null and alternative hypotheses <ul><li>The alternative hypotheses H1 means </li></ul><ul><li>X1>X2 or X1< X2 </li></ul><ul><li>this means that there is no difference between x1 and x2. </li></ul><ul><li>If we reject the null hypothesis, i.e there is a difference between the two readings, it is either H1 : x1 < x2 or H2 : x1> x2 </li></ul><ul><li>in other words the null hypothesis is rejected because x1 is different from x2. </li></ul>
  85. 89. General principles of significance tests <ul><li>set up a null hypothesis and its alternative. </li></ul><ul><li>find the value of the test statistic. </li></ul><ul><li>refer the value of the test statistic to a known distribution which it would follow if the null hypothesis was true. </li></ul>
  86. 90. General principles of significance tests <ul><li>4-conclude that the data are consistent or inconsistent with the null hypothesis. </li></ul><ul><li>If the data are not consistent with the null hypotheses, the difference is said to be statistically significant. If the data are consistent with the null hypotheses it is said that we accept it i.e. statistically insignificant. </li></ul>
  87. 91. General principles of significance tests P<0.05 <ul><li>In medicine, we usually consider that differences are significant if the probability is less than 0.05. This means that if the null hypothesis is true, we shall make a wrong decision less than 5 in a hundred times </li></ul>
  88. 92. Tests of significance <ul><li>The selection of test of significance depends essentially on the type of data that we have. </li></ul><ul><li>1-Quantitative Data ( Means & SD): t test , paired t test and , ANOVA </li></ul><ul><li>2-Qualitative Data>>> Chi , and Z test </li></ul><ul><li>. </li></ul>
  89. 93. Tests of significance <ul><li>Comparison of means : </li></ul><ul><li>1-comparing two means of large samples using the normal distribution: </li></ul><ul><li>(z test or SND standard normal deviate) </li></ul><ul><li>If we have a large sample size i.e. 60 or more and it follows a normal distribution then we have to use the z-test. </li></ul><ul><li>z = (population mean — sample mean) / SD. If the result of z >2 then there is significant difference. </li></ul>
  90. 94. Tests of significance <ul><li>Since the normal range for any biological reading lies between the mean value of the population reading ± 2 SD. (this range includes 95% of the area under the normal distribution curve). </li></ul>
  91. 95. Student’s t-test <ul><li>2-Comparing two means of small samples using t-test: </li></ul><ul><li>If we have a small sample size (less than 60), we can use the t distribution instead of the normal distribution. </li></ul><ul><li>T = mean1 — mean2 /  (SD 1 2 / n1) + (SD 2 2 / n2) </li></ul>
  92. 96. <ul><li>The value of t will be compared to values in the specific table of &quot;t distribution test&quot; at the value of the degree of freedom. If the value of t is less than that in the table , then the difference between samples is insignificant. </li></ul><ul><li>If the t value is larger than that in the table so the difference is significant i.e. the null hypothesis is rejected. </li></ul>t-test
  93. 97. <ul><li>2-Comparing two means of small samples using t-test: </li></ul><ul><li>If we have a small sample size (less than 60), we can use the t distribution instead of the normal distribution. </li></ul><ul><li>T = mean1 — mean2 /  (SD 1 2 / n1) + (SD 2 2 / n2) </li></ul>t-test
  94. 98. <ul><li>3-paired t-test: </li></ul><ul><li>If we are comparing repeated observation in the same individual or difference between paired data, we have to use paired t-test where the analysis is carried out using the mean and standard deviation of the difference between each pair. </li></ul>Paired t-test
  95. 99. <ul><li>4-comparing several means: </li></ul><ul><li>Sometimes we need to compare more than two means, this can be done by the use of several t-test which is not only tedious but can lead to spurious significant results. Therefore we have to use what we call analysis of variance or ANOVA. </li></ul>ANOVA
  96. 100. <ul><li>4-comparing several means: </li></ul><ul><li>There are two main types: one-way analysis of variance and two-way analysis of variance. One-way analysis of variance is appropriate when the subgroups to be compared are defined by just one factor, for example comparison between means of different socio-economic classes. The two-way analysis of variables is used when the subdivision is based upon more than one factor </li></ul>ANOVA
  97. 101. <ul><li>The main idea in the analysis of variance is that we have to take into account the variability within the groups and between the groups and value of F is equal to the ratio between the means sum square of between the groups and within the groups. </li></ul><ul><li>F = between-groups MS / within-groups MS </li></ul>ANOVA
  98. 102. <ul><li>b-Qualitative variables: </li></ul><ul><li>1)Chi -squared test : </li></ul><ul><li>Qualitative data are arranged in table formed by rows and columns. One variable define the rows and the categories of the other variable define the column. </li></ul>Chi-Squared Test
  99. 103. <ul><li>A chi-squared test is used to test whether there is an association between the row variable and the column variable or, in other words whether the distribution of individuals among the categories of one variable is independent of their distribution among the categories of the other. </li></ul><ul><li>X 2 =  (O-E) 2 / E </li></ul>Chi-Squared Test
  100. 104. <ul><li>1)Chi -squared test : </li></ul><ul><li>degree of freedom = (row - 1) (column - 1) </li></ul><ul><li>O = observed value in the table </li></ul><ul><li>E = expected value calculated as follows: E = Rt x Ct / GT </li></ul><ul><li>total of row x total of column / grand total </li></ul>Chi-Squared Test
  101. 106. <ul><li>From tables of X2 significance at degree of freedom (row 3-1)x(column 3-1) = 2x 2=4. The level of significance at 0.05 level, d.f.=4 is 9.48. therefore we conclude that there is significant relation between socioeconomic level and the degree of intelligence (because the value of X2 > that of the table). </li></ul>Chi-Squared Test
  102. 107. <ul><li>2) Z test for comparing two percentages: </li></ul><ul><li>z = p1 – p2 /√ p1q1/n1 + p2q2/n2 . where p1=percentage in the 1 st group. P2 = percentage in the 2 nd group, q1=100-p1, q2=100-p2, n1= sample size of group 1, n2=sample size of group2.Z test is significant(at 0.05 level)if the result>2. </li></ul>Z Test
  103. 108. <ul><li>Example: if the number of anemic patients in group 1 which includes 50 patients is 5 and the number of anemic patients in group 2 which contains 60 patients is 20. To find if groups 1 & 2 are statistically different in prevalence of anemia we calculate z test. </li></ul><ul><li>P1=5/50=10% p2=20/60=33% q1=100-10=90 q2=100-33=67 </li></ul>Chi-Squared Test
  104. 109. <ul><li>Z=10 – 33/ √ 10x90/50 + 33x67/60 </li></ul><ul><li>Z= 23 / √ 18 + 36.85 z= 23/ 7.4 z= 3.1 </li></ul><ul><li>Therefore there is statistical significant difference between percentages of anemia in the studied groups (because z >2). </li></ul>Chi-Squared Test
  105. 110. <ul><li>c- Correlation and regression: </li></ul><ul><li>Correlation measures the closeness of the association between two continuous variables, while linear regression gives the equation of the straight line that best describes and enables the prediction of one variable from the other. </li></ul>Correlation & regression
  106. 111. <ul><li>1- Correlation: </li></ul><ul><li>In the correlation, the closeness of the association is measured by the correlation coefficient, r. The values of r ranges between + 1 and —1. </li></ul><ul><li>One means perfect correlation while 0 means no correlation. If r value is near the zero, it means weak correlation while near the one it means strong correlation. The sign — and + denotes the direction of correlation, </li></ul>Correlation & regression
  107. 112. <ul><li>1- Correlation: </li></ul><ul><li>the +ve correlation means that if one variable increases the other one increases similarly while for the –ve correlation means that when one variable increases the other one decreases </li></ul>Correlation
  108. 113. <ul><li>2- Linear regression: </li></ul><ul><li>Similar to correlation, linear regression is used to determine the relation and prediction of the change in a variable due to changes in other variable. For linear regression, the independent factor has to be specified from the dependent variable. </li></ul>Linear regression
  109. 114. <ul><li>2- Linear regression: </li></ul><ul><li>The linear regression, not only allow assessment of the presence of association between the independent and dependent variable but also allows the prediction of dependent variable for a particular independent variable. However, regression for prediction should not be used outside the range of original data. a t-test is also used for the assessment of the level of significance. The dependent variable in linear regression must be a continuous one. </li></ul>Linear regression
  110. 116. <ul><li>3- Multiple regression: </li></ul><ul><li>Situations frequently occur in which we are interested in the dependency of a dependent variable on several independent variables, not just one. Test of significance used is the analysis of variance.(F test). </li></ul>Multiple regression
  111. 117. <ul><li>How do you select a representative sample of 100 students from a primary school – Use all possible methods of sample selection </li></ul><ul><li>How to select a primary school from a rural area and another school from an urban area in Egypt? </li></ul>
  112. 118. What Type of Sample is? <ul><li>Lottery to select a winner </li></ul><ul><li>Hospitalized Patients with SLE </li></ul><ul><li>Every 6 th patient coming to an outpatient clinic </li></ul><ul><li>Random 20 females and 20 males out of group of 100 person </li></ul><ul><li>All workers in a factory chosen from all factories in certain governorate </li></ul>
  113. 119. Present the following data by a suitable table & graph <ul><li>Infant mortality rates in 2006 in some countries were as follows : Egypt =25/1000 , USA=10/1000 , Sweden 12/1000 and Pakistan= 30/1000 </li></ul>
  114. 120. Present the following data by a suitable table & graph <ul><li>A the body weight (Kg ) of a group of male children were as follow: </li></ul><ul><li>12-22-18-17-28-20-16-21-19-16-27-21 Kg and for a group of female children were as follows: </li></ul><ul><li>16-23-19-29-18-22-17-15-21-21-24 Kg </li></ul>
  115. 121. <ul><li>The weight (Kg ) of a pregnant </li></ul>