Upcoming SlideShare
×

# Book001(statweb.blogspot.com)

317 views

Published on

Published in: Technology, News & Politics
0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total views
317
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
9
0
Likes
0
Embeds 0
No embeds

No notes for slide

### Book001(statweb.blogspot.com)

1. 1. Fundamentals of some Basic Statistical DefinitionsBasic Definitions Sample A sample is a group of unitsStatistical Inference selected from a larger group (the population). By studying the sample it isStatistical Inference makes use of hoped to draw valid conclusions aboutinformation from a sample to draw the larger group.conclusions (inferences) about the A sample is generally selected forpopulation from which the sample was study because the population is tootaken. large to study in its entirety.The sample should be representative of the general population. This is often best achievedExperiment by random sampling. Also, before collecting the sample, it is important thatAn experiment is any process or study the researcher carefully andcompletelywhich results in the collection of data, defines the population, including athe outcome of which is unknown. In description of the members to bestatistics, the term is usually restricted to included.situations in which the researcher has Examplecontrol over some of the conditionsunder which the experiment takes place. The population for a study of infant health might be all children born inExample the UK in the 1980s.The sample might Before introducing a new drug be all babies born on 7th May in any oftreatment to reduce high blood the years.pressure, the manufacturer carries outan experiment to compare the Parametereffectiveness of the new drug with that A parameter is a value, usuallyof one currently prescribed. Newly unknown (and which therefore has to bediagnosed subjects are recruited from a estimated), used to represent a certaingroup of local general practices. Half of population characteristic.For example,them are chosen at random to receive the population mean is a parameter thatthe new drug, the remainder receiving is often used to indicate the averagethe present one. So, the researcher has value of a quantity.Within a population,control over the type of subject recruited a parameter is a fixed value which doesand the way in which they are allocated not vary. Eachsample drawn from theto treatment. population has its own value of any statistic that isused to estimate thisExperimental (or Sampling) Unit parameter. For example, the mean of the data in a sample is used to giveA unit is a person, animal, plant or thing information about the overall mean inwhich is actually studied by a the population from which that sampleresearcher; the basic objects upon was drawn. Parameters are oftenwhich the study or experiment is carried assigned Greek letters (e.g. ), whereasout. For example, a person; a monkey; a statistics are assigned Roman letterssample of soil; a pot of seedlings; a (e.g. s).postcode area; a doctors practice.K.MANOJ.M.Sc.,M.phil.,D.C.A., Page 1
2. 2. Fundamentals of some Basic Statistical DefinitionsStatistic parameter µ; is normally distributed A statistic is a quantity that is with expected value µ and variancecalculated from a sample of data. It is /n.used to give information about unknownvalues in the corresponding population. EstimateFor example, the average of the data ina sample is used to give information An estimate is an indication of theabout the overall average in the value of an unknown quantity based onpopulation from which that sample observed data.wasdrawn. It is possible to draw morethan one sample from the samepopulation and the value of a statistic More formally, an estimate is thewill in general vary from sample to particular value of an estimator that issample. For example, the average value obtained from a particular sample ofin a sample is a statistic. The average data and used to indicate the value of avalues in more than one sample, drawn parameter.from the same population, will notnecessarily be equal. Statistics are often Exampleassigned Roman letters (e.g. m and s), Suppose the manager of a shopwhereas the equivalent unknown values wanted to know the mean expenditurein the population (parameters ) are of customers in her shop in the lastassigned Greek letters (e.g. µ and ). year. She could calculate the average expenditure of the hundreds (or perhapsSampling Distribution thousands) of customers who bought goods in her shop, that is, theThe sampling distribution describes population mean. Instead she could useprobabilities associated with a statistic an estimate of this population mean bywhen a random sample is drawn from a calculating the mean of a representativepopulation. sample of customers. If this value was found to be £25, then £25 would be herThe sampling distribution is the estimate.probability distribution or probabilitydensity function of the statistic. EstimatorDerivation of the sampling distribution isthe first step in calculating a confidence An estimator is any quantityinterval or carrying out a hypothesis test calculated from the sample data whichfor a parameter. is used to give information about an unknown quantity in the population. ForExample example, the sample mean is anSuppose that x1, ......., xn are a simple estimator of the population mean.random sample from a normallydistributed population with expected Estimators of population parameters are sometimesvalue µ and known variance . Then distinguished from the true value bythe sample mean is a statistic used to using the symbol hat. For example,give information about the populationK.MANOJ.M.Sc.,M.phil.,D.C.A., Page 2
3. 3. Fundamentals of some Basic Statistical Definitions Compare continuous data. = true population standard deviation Categorical Data = estimated (from a sample) population standard deviation A set of data is said to be categorical if the values or observationsExample belonging to it can be sorted according to category. Each value is chosen fromThe usual estimator of the population a set of non-overlapping categories. Formean is example, shoes in a cupboard can be sorted according to colour: the characteristic colour can have non-where n is the size of the sample and overlapping categories black, brown,X1, X2, X3, ......., Xn are the values of the red and other. People have thesample. characteristic of gender with categories male and female.If the value of the estimator in aparticular sample is found to be 5, then Categories should be chosen5 is the estimate of the population mean carefully since a bad choice canµ. prejudice the outcome of an investigation. Every value should belong to one and only one category, and thereEstimation should be no doubt as to which one. Estimation is the process bywhich sample data are used to indicate Nominal Datathe value of an unknown quantity in apopulation. A set of data is said to be nominal if the values / observationsResults of estimation can be expressed belonging to it can be assigned a codeas a single value, known as a point in the form of a number where theestimate, or a range of values, known as numbers are simply labels. You cana confidence interval. count but not order or measure nominal data. For example, in a data set malesDiscrete Data could be coded as 0, females as 1; marital status of an individual could be coded as Y if married, N if single. A set of data is said to bediscrete if the values / observationsbelonging to it are distinct and separate,i.e. they can be counted (1,2,3,....). Ordinal DataExamples might include the number ofkittens in a litter; the number of patients A set of data is said to be ordinal if thein a doctors surgery; the number of values / observations belonging to it canflaws in one metre of cloth; gender be ranked (put in order) or have a rating(male, female); blood group (O, A, B, scale attached. You can count andAB). order, but not measure, ordinal data.K.MANOJ.M.Sc.,M.phil.,D.C.A., Page 3
4. 4. Fundamentals of some Basic Statistical DefinitionsThe categories for an ordinal set of data count, order and measure continuoushave a natural order, for example, data. For example height, weight,suppose a group of people were asked temperature, the amount of sugar in anto taste varieties of biscuit and classify orange, the time required to run a mile.each biscuit on a rating scale of 1 to 5,representing strongly dislike, dislike, Compare discrete data.neutral, like, strongly like. A rating of 5indicates more enjoyment than a ratingof 4, for example, so such data are Frequency Tableordinal. A frequency table is a way ofHowever, the distinction between summarising a set of data. It is a recordneighbouring points on the scale is not of how often each value (or set ofnecessarily always the same. For values) of the variable in questioninstance, the difference in enjoyment occurs. It may be enhanced by theexpressed by giving a rating of 2 rather addition of percentages that fall intothan 1 might be much less than the each category.difference in enjoyment expressed bygiving a rating of 4 rather than 3. A frequency table is used to summarise categorical, nominal, and ordinal data. It may also be used toInterval Scale summarise continuous data once the data set has been divided up into An interval scale is a scale of sensible groups.measurement where the distancebetween any two adjacents units of When we have more than onemeasurement (or intervals) is the same categorical variable in our data set, abut the zero point is arbitrary. Scores on frequency table is sometimes called aan interval scale can be added and contingency table because the figuressubtracted but can not be meaningfully found in the rows are contingent uponmultiplied or divided. For example, the (dependent upon) those found in thetime interval between the starts of years columns.1981 and 1982 is the same as thatbetween 1983 and 1984, namely 365 Exampledays. The zero point, year 1 AD, isarbitrary; time did not begin then. Other Suppose that in thirty shots at aexamples of interval scales include the target, a marksman makes the followingheights of tides, and the measurement scores:of longitude.Continuous Data 52234 43203 03215 A set of data is said to be 13155 24004 54455continuous if the values / observationsbelonging to it may take on any value The frequencies of the different scoreswithin a finite or infinite interval. You can can be summarised as:K.MANOJ.M.Sc.,M.phil.,D.C.A., Page 4
5. 5. Fundamentals of some Basic Statistical Definitions Bar Chart Score Frequency Frequency (%) A bar chart is a way of 0 4 13% summarising a set of categorical data. It 1 3 10% is often used in exploratory data analysis to illustrate the major features 2 5 17% of the distribution of the data in a 3 5 17% convenient form. It displays the data 4 6 20% using a number of rectangles, of the 5 7 23% same width, each of which represents a particular category. The length (and hence area) of each rectangle is proportional to the number of cases inPie Chart the category it represents, for example, age group, religious affiliation. A pie chart is a way ofsummarising a set of categorical data. It Bar charts are used tois a circle which is divided into summarise nominal or ordinal data.segments. Each segment represents aparticular category. The area of each Bar charts can be displayedsegment is proportional to the number of horizontally or vertically and they arecases in that category. usually drawn with a gap between the bars (rectangles), whereas the bars of aExample histogram are drawn immediately next Suppose that, in the last year a to each other.sports wear manufacturers has spent 6million pounds on advertising theirproducts; 3 million has been spent ontelevision adverts, 2 million onsponsorship, 1 million on newspaperadverts, and a half million on posters.This spending can be summarised usinga pie chart:K.MANOJ.M.Sc.,M.phil.,D.C.A., Page 5
6. 6. Fundamentals of some Basic Statistical DefinitionsDot Plot become tedious to construct. A histogram can also help detect any A dot plot is a way of unusual observations (outliers), or anysummarising data, often used in gaps in the data set.exploratory data analysis to illustrate themajor features of the distribution of thedata in a convenient form. For nominal or ordinal data, a dotplot is similar to a bar chart, with thebars replaced by a series of dots. Eachdot represents a fixed number ofindividuals. For continuous data, the dotplot is similar to a histogram, with therectangles replaced by dots. A dot plot can also help detectany unusual observations (outliers), orany gaps in the data set. Compare bar chart.Histogram A histogram is a way of Stem and Leaf Plotsummarising data that are measured onan interval scale (either discrete or A stem and leaf plot is a way ofcontinuous). It is often used in summarising a set of data measured onexploratory data analysis to illustrate the an interval scale. It is often used inmajor features of the distribution of the exploratory data analysis to illustrate thedata in a convenient form. It divides up major features of the distribution of thethe range of possible values in a data data in a convenient and easily drawnset into classes or groups. For each form.group, a rectangle is constructed with abase length equal to the range of values A stem and leaf plot is similar to ain that specific group, and an area histogram but is usually a moreproportional to the number of informative display for relatively smallobservations falling into that group. This data sets (<100 data points). It providesmeans that the rectangles might be a table as well as a picture of the datadrawn of non-uniform height. and from it we can readily write down the data in order of magnitude, which is The histogram is only appropriate useful for many statistical procedures,for variables whose values are e.g. in the skinfold thickness examplenumerical and measured on an interval below:scale. It is generally used when dealingwith large data sets (>100observations), when stem and leaf plotsK.MANOJ.M.Sc.,M.phil.,D.C.A., Page 6
7. 7. Fundamentals of some Basic Statistical Definitions observations are involved and when two or more data sets are being compared. We can compare more than one 5-Number Summarydata set by the use of multiple stem andleaf plots. By using a back-to to-back stem A 5-number number summary isand leaf plot, we are able to compare especially useful when we have sothe same characteristic in two different many data that it is sufficient to presentgroups, for example, pulse rate after a summary of the data rather than theexercise of smokers and non- -smokers. whole data set. It consists of 5 values: the most extreme values in the data setBox and Whisker Plot (or Boxplot) (maximum and minimum values), the imum lower and upper quartiles, and the quartiles A box and whisker plot is a way median.of summarising a set of data measuredon an interval scale. It is often used inexploratory data analysis. It is a type ofgraph which is used to show the shape A 5-number summary can beof the distribution, its central value, and represented in a diagram known as avariability. The picture produced box and whisker plot. In cases where we .consists of the most extreme values in have more than one data set to analyse,the data set (maximum and minimum a 5-number summary is constructed for numbervalues), the lower and upper quartiles, each, with corresponding multiple boxand the median. and whisker plots. A box plot (as it is often called) isespecially helpful for indicating whether Outliera distribution is skewed and whetherthere are any unusual observations An outlier is an observation in a(outliers) in the data set. data set which is far removed in value from the others in the data set. It is an Box and whisker plots are also unusually large or an unusually smallvery useful when large numbers of value compared to the others.K.MANOJ.M.Sc.,M.phil.,D.C.A., Page 7
8. 8. Fundamentals of some Basic Statistical Definitions Skewness An outlier might be the result ofan error in measurement, in which case Skewness is defined asit will distort the interpretation of the asymmetry in the distribution of thedata, having undue influence on many sample data values. Values on one sidesummary statistics, for example, the of the distribution tend to be further frommean. the middle than values on the other side. If an outlier is a genuine result, itis important because it might indicate an For skewed data, the usualextreme of behaviour of the process measures of location will give differentunder study. For this reason, all outliers values,forexample,mode<median<meanmust be examined carefully before would indicate positive (or right)embarking on any formal analysis. skewness.Outliers should not routinely be removedwithout further justification. Positive (or right) skewness is more common than negative (or left) skewness.Symmetry If there is evidence of skewness Symmetry is implied when data in the data, we can applyvalues are distributed in the same way transformations, for example, takingabove and below the middle of the logarithms of positive skew data.sample. Compare symmetry.Symmetrical data sets: Transformation to Normality a. are easily interpreted; b. allow a balanced attitude to If there is evidence of marked outliers, that is, those above and non-normality then we may be able to below the middle value ( median) remedy this by applying suitable can be considered by the same transformations. criteria; c. allow comparisons of spread or The more commonly used dispersion with similar data sets. transformations which are appropriate for data which are skewed to the right Many standard statistical techniques with increasing strength (positive skew)are appropriate only for a symmetric are 1/x, log(x) and sqrt(x), where the xsdistributional form. For this reason, are the data values.attempts are often made to transformskew-symmetric data so that they The more commonly usedbecome roughly symmetric. transformations which are appropriate for data which are skewed to the left with increasing strength (negative skew) are squaring, cubing, and exp(x).K.MANOJ.M.Sc.,M.phil.,D.C.A., Page 8
9. 9. Fundamentals of some Basic Statistical DefinitionsScatter Plot between the two variables is negative (inverse). A scatterplot is a useful summary d. If there exists a random scatter ofof a set of bivariate data (two variables), points, there is no relationshipusually drawn before working out a between the two variables (verylinear correlation coefficient or fitting a low or zero correlation).regression line. It gives a good visual e. Very low or zero correlation couldpicture of the relationship between the result from a non-lineartwo variables, and aids the interpretation relationship between theof the correlation coefficient or variables. If the relationship is inregression model. fact non-linear (points clustering around a curve, not a straight Each unit contributes one point to line), the correlation coefficientthe scatterplot, on which points are will not be a good measure of theplotted but not joined. The resulting strength.pattern indicates the type and strengthof the relationship between the two A scatterplot will also show up a non-variables. linear relationship between the two variables and whether or not there exist any outliers in the data. More information can be added to a two-dimensional scatterplot - for example, we might label points with a code to indicate the level of a third variable. If we are dealing with many variables in a data set, a way of presenting all possible scatter plots of two variables at a time is in a scatterplot matrix.Illustrations a. The more the points tend to Sample Mean cluster around a straight line, the stronger the linear relationship The sample mean is an estimator between the two variables (the available for estimating the population higher the correlation). mean . It is a measure of location, b. If the line around which the points commonly called the average, often tends to cluster runs from lower left to upper right, the relationship symbolised . between the two variables is positive (direct). Its value depends equally on all c. If the line around which the points of the data which may include outliers. It tends to cluster runs from upper may not appear representative of the left to lower right, the relationship central region for skewed data sets.K.MANOJ.M.Sc.,M.phil.,D.C.A., Page 9
10. 10. Fundamentals of some Basic Statistical Definitions It is especially useful as being 57 55 85 24 33 49 94 2 8representative of the whole sample for Data 51 71 30 91 6 47 50 65 43use in subsequent calculations. 41 7 2 6 7 8 24 30 33 41 43 47Example Ordered 49 50 51 55 57 65 71 85 Lets say our data set is: 5 3 54 Data 91 9493 83 22 17 19. Median Halfway between the two The sample mean is calculated middle data points - inby taking the sum of all the data values this case halfway betweenand dividing by the total number of data 47 and 49, and so thevalues: median is 48 ModeMedian The mode is the most frequently The median is the value halfway occurring value in a set of discrete data.through the ordered data set, below and There can be more than one mode ifabove which there lies an equal number two or more values are equallyof data values. common. It is generally a good descriptive Examplemeasure of the location which workswell for skewed data, or data with Suppose the results of an end ofoutliers. term Statistics exam were distributed as follows:The median is the 0.5 quantile.Example Student: Score:</I.< td> 1 94 With an odd number of data 2 81values, for example 21, we have: 3 56 96 48 27 72 39 70 7 68 Data 99 36 95 4 6 13 34 74 65 4 90 42 28 54 69 5 70 4 6 7 13 27 28 34 36 39 6 65 Ordered 42 48 54 65 68 69 70 72 7 90 Data 74 95 96 99 8 90 48, leaving ten values 9 30 Median below and ten values Then the mode (most common above score) is 90, and the median (middle score) is 81.With an even number of data values, forexample 20, we have:K.MANOJ.M.Sc.,M.phil.,D.C.A., Page 10