What is a variable?In statistics, a variable has two defining characteristics:A variable is an attribute that describes a person, place, thing, or idea.The value of the variable can "vary" from one entity to another.For example, a persons hair color is a potential variable, which couldhave the value of "blond" for one person and "brunette" for another.
Qualitative vs. Quantitative VariablesVariables can be classifiedas qualitative (aka,categorical – Age, likertscale, race) or quantitative(aka, numeric).
Examples of types of data QuantitativeContinuous DiscreteBlood pressure, height, weight, Number of children, Number ofage attacks of asthma per week CategoricalOrdinal (Ordered categories) Nominal (Unordered categories)Grade of breast cancer Sex (male/female)Better, same, worse Alive or deadDisagree, neutral, agree Blood group O, A, B, AB
Graphical presentation of data is• better understood and appreciated by humans.• brings out the hidden pattern and trends of the complex data sets.Thus the reason for displaying data graphically istwo fold:• Investigators can have a better look at the information collected and the distribution of data• To communicate this information to others quickly We shall discuss in detail some of the commonly used graphical presentations.
Bar Charts : Bar charts areused for qualitative type ofvariableHere the variable studied isplotted in the form of baralong the X-axis (horizontal)and the height of the bar isequal to the percentage orfrequencies which are plottedalong the Y-axis (vertical).
Pie ChartAnother interestingmethod of displayingcategorical (qualitative)data is a pie diagram alsocalled as circular diagram.X/100*360
Pie ChartA pie diagram is best when the total categories are between 2 to 6. If there are more than 6 categories, try and reduce them by “clubbing”, otherwise the diagrambecomes too overcrowded.
Stem-and-leaf plotsThis presentation is used for quantitative type ofdata.To construct a stem-and-leaf plot, we divide eachvalue into a stem component and leaf component.The digits in the tens-place becomes stemcomponent and the digits in units place becomesleaf components.It is of much utility in quickly assessing whetherthe data is following a “normal” distribution ornot, by seeing whether the stem and leaf isshowing a bell shape or not.For example consider a sample of 10 values of agein years : 21, 42, 05, 11, 30, 50, 28, 27, 24, 52.
HistogramA histogram is used forquantitative continuous typeof data where, on the X-axis,we plot the quantitativeexclusive type of class intervalsand on the Y-axis we plot thefrequencies.The difference between barcharts and histogram is thatsince histogram is the bestrepresentation for quantitativedata measured on continuousscale, there are no gapsbetween the bars.
Box-and-Whisker plotA box-and-whisker plot revealsmaximum of the information to theaudience.A box-and whisker plot can be usefulfor handling many data values.They allow people to explore data andto draw informal conclusions whentwo or more variables are present.It shows only certain statistics ratherthan all the data.
Box-and-Whisker plotFive-number summary is another name for thevisual representations of the box and whiskerplot. MaximumThe five-number summary consists of the Q3median, the quartiles (lower quartile and upperquartile), and the smallest and greatest values Range IQR Medianin the distribution. Q1 MinimumThus a box-and-whisker plot displays the• center,• the spread,• overall range of distribution
Scatter Diagram A scatter diagram gives a quick visual display of the association between two variables, both of which are measured on numerical continuous or numerical discrete scale. (Both quantitative) Figure shows instant finding that weight and age are associated - as age increases, weight increases. Be careful to record the dependent variable along the vertical (Y) axis and the independent variable along the horizontal (X) axis.
Scatter Diagram In this example weight is dependent on age (as age increases weight is likely to increase) but age is not dependent on weight (if weight increases, age will not necessarily increase). Thus, weight is the dependent variable, and has been plotted on Y axis while age is the independent variable, plotted along X axis.
Correlation coefficientThe degree of association is measured bya correlation coefficient, denoted by r.It is sometimes called Pearsonscorrelation coefficient after its originatorand is a measure of linear association.
Correlation coefficientThe correlation coefficient is measured on ascale that varies from + 1 through 0 to - 1.Complete correlation between two variables isexpressed by either + 1 or -1.• When one variable increases as the other increases the correlation is positive; (coffee v/s wakefulness)• when one decreases as the other increases it is negative. (Old is gold!)• Complete absence of correlation is represented by 0.
A perfect correlation of ± 1occurs only when the datapoints all lie exactly on astraight line.A correlation greater than0.8 would be described asstrong, whereas a correlationless than 0.5 would bedescribed as weak.
Correlation coefficient v/s Regression analysis Regression is used When the objective is to extensively in making determine association or the predictions based on strength of relationship between two such variables, we use finding unknown Y values correlation coefficient (r). from known X values. If the objective is to quantify and Multiple Regression is the describe the existing relationship same as regression except with a view of prediction, we use that it attempts to predict Y regression analysis. from two or more independent X variables.
Summarising the Data: Measures of Central Tendency and Variability
Measures of Central TendencyThis gives the centrality measure of the data set i.e. where the observations areconcentrated. There are numerous measures of central tendency. These are : Mean;Median; Mode; Geometric Mean; Harmonic Mean. Mean (Arithmetic Mean) or Average It is calculated as follows.This is most appropriate measure fordata following normal distribution. Itis calculated by summing all theobservations and then dividing bynumber of observations. It isgenerally denoted by x.
Mean (Arithmetic Mean) or AverageIt is the simplest of the centrality It depends on all measure but is values of the data influenced by set but is affectedextreme values and by the fluctuationshence at times may of sampling give fallacious results.
Example : The serum cholesterol level (mg/dl) of 10 subjectswere found to be as follows:192 242 203 212 175 284 256 218 182 228
Median .When the data is skewed, another measure of central tendency calledmedian is used.Median is a locative measure which is the middle most observationafter all the values are arranged in ascending or descending order.In case when there is odd number of observations we have a singlemost middle value which is the median value.In case when even number of observations is present there are twomiddle values and the median is calculated by taking the mean ofthese two middle observationsIt is less affected by fluctuations of sampling than mean.
Mode Though mode is easy to calculate, at times it may beMode is the most common impossible to calculatevalue that repeats itself in mode if we do not have any the data set. value repeating itself in the data set. At other end it may so happen that we come In such cases the across two or more values distribution are said torepeating themselves same bimodal or multimodal. number of times.
Measures of Relative Position (Quantiles)Quantiles are the values that divide a set numerical data arranged inincreasing order into equal number of parts.Quartiles divide the numerical data arranged in increasing order into fourequal parts of 25% each. • Thus there are 3 quartiles Q1, Q2 and Q3 respectively.Deciles are values which divide the arranged data into ten equal parts of 10%each. • Thus we have 9 deciles which divide the data in ten equal parts.Percentiles are the values that divide the arranged data into hundred equalparts of 1% each. • Thus there are 99 percentiles. • Q) Median = ___ percentile, ____ decile and ____quartile.
AnswerThe 50th percentile, 5 thdecile and 2 nd quartileare equal to median.
Measures of VariabilityIn contrast to measures of centraltendency which describes thecenter of the data set, measures ofvariability describes the variabilityor spreadness of the observationfrom the center of the data.
Measures of VariabilityVarious measures of dispersionare as follows.• Range• Interquartile range• Mean deviation• Standard deviation• Coefficient of variation
Range One of the simplest measures of variability is range. Range isthe difference between the two Range = maximum observation extremes i.e. the difference – minimum observation between the maximum and minimum observation. Drawback of range is that it It gives rough idea of theuses only extreme observations dispersion of the data. and ignores the rest.
Interquartile RangeAs in the case of range difference in extremeobservations is found, similarly interquartilerange is calculated by taking difference in thevalues of the two extreme quartiles.Interquartile range = Q3 - Q1
Coefficient of Variation • measures variability in relation toBesides the measures the mean (or average) and is usedof variability discussed to compare the relative dispersionabove, we have one in one type of data with the relative dispersion in another typemore important of data.measure called the • The data to be compared may becoefficient of variation in the same units, in differentwhich compares the units, with the same mean, or with different means.variability in two datasets.