Advanced statistics


Published on

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Advanced statistics

  2. 2. TOPIC OUTLINEPART 1Role of Statistics in ResearchDescriptive StatisticsHands –On Statistical SoftwareSample and PopulationSampling ProceduresSample SizeHands –On Statistical SoftwareInferential StatisticsHypothesis TestingHands –On Statistical Software
  3. 3. TOPIC OUTLINEPART 2Choice of Statistical TestsDefining Independent and DependentVariablesHands –On Statistical SoftwareScales of MeasurementsHow many Samples / Groups are in the DesignPART 3Parametric TestsHands –On Statistical SoftwarePART 4Non-Parametric TestsHands –On Statistical Software
  4. 4. TOPIC OUTLINEPART 5Goodness of FitHands –On Statistical SoftwarePART 6Choosing the Correct Statistical TestsHands –On Statistical SoftwareIntroduction to Multiple and Non-LinearRegressionHands –On Statistical Software
  5. 5. Role of Statistics in ResearchNormally use to analyze dataTo organize and make sense out of large amountof dataThis is basic to intelligent reading researcharticleHas significant contributions in social sciences,applied sciences and even business andeconomicsStatistical researches make inferences aboutpopulation characteristics on the basis of one ormore samples that have been studied.
  6. 6. How is Statistics look into ?1. Descriptive – this gives us information ,or simple describe the sample we arestudying.2. Correlational - this enables us to relatevariables and establish relationshipbetween and among variables which areuseful in making predictions.3. Inferential – this is going beyond thesample and make inference on thepopulation.
  7. 7. Descriptive Statistics N - total population/sample size from any givenpopulationExampleMinutes Spent on the Phone102 124 108 86 103 8271 104 112 118 87 95103 116 85 122 87 100105 97 107 67 78 125109 99 105 99 101 92
  8. 8. Example 2425 430 430 435 435 435 435 435 440 440440 440 440 445 445 445 445 445 450 450450 450 450 450 450 460 460 460 465 465465 470 470 472 475 475 475 480 480 480480 485 490 490 490 500 500 500 500 510510 515 525 525 525 535 549 550 570 570575 575 580 590 600 600 600 600 615 615
  9. 9. Range, Mean, Median and ModeThe terms mean, median, mode, and range describeproperties of statistical distributions. In statistics, a distributionis the set of all possible values for terms that represent definedevents. The value of a term, when expressed as a variable, iscalled a random variable. There are two major types ofstatistical distributions. The first type has a discrete randomvariable. This means that every term has a precise, isolatednumerical value. An example of a distribution with a discreterandom variable is the set of results for a test taken by a class inschool. The second major type of distribution has a continuousrandom variable. In this situation, a term can acquire anyvalue within an unbroken interval or span. Such a distributionis called a probability density function. This is the sort offunction that might, for example, be used by a computer in anattempt to forecast the path of a weather system.
  10. 10. MeanThe most common expression for the mean of a statisticaldistribution with a discrete random variable is themathematical average of all the terms. To calculate it,add up the values of all the terms and then divide by thenumber of terms. This expression is also called thearithmetic mean. There are other expressions for themean of a finite set of terms but these forms are rarelyused in statistics. The mean of a statistical distributionwith a continuous random variable, also called theexpected value, is obtained by integrating the product ofthe variable with its probability as defined by thedistribution. The expected value is denoted by thelowercase Greek letter mu (µ).
  11. 11. Median The median of a distribution with a discrete random variabledepends on whether the number of terms in the distribution iseven or odd. If the number of terms is odd, then the median isthe value of the term in the middle. This is the value such thatthe number of terms having values greater than or equal to it isthe same as the number of terms having values less than orequal to it. If the number of terms is even, then the median isthe average of the two terms in the middle, such that thenumber of terms having values greater than or equal to it is thesame as the number of terms having values less than or equal toit. The median of a distribution with a continuous randomvariable is the value m such that the probability is at least1/2 (50%) that a randomly chosen point on the functionwill be less than or equal to m, and the probability is atleast 1/2 that a randomly chosen point on the function willbe greater than or equal to m.
  12. 12. ModeThe mode of a distribution with a discrete randomvariable is the value of the term that occurs the mostoften. It is not uncommon for a distribution with adiscrete random variable to have more than one mode,especially if there are not many terms. This happens whentwo or more terms occur with equal frequency, and moreoften than any of the others. A distribution with twomodes is called bimodal. A distribution with three modesis called trimodal. The mode of a distribution with acontinuous random variable is the maximum value ofthe function. As with discrete distributions, there may bemore than one mode.
  13. 13. RangeThe range of a distribution with a discreterandom variable is the difference between themaximum value and the minimum value. For adistribution with a continuous random variable,the range is the difference between the twoextreme points on the distribution curve,where the value of the function falls to zero.For any value outside the range of a distribution,the value of the function is equal to 0.The least reliable of the measure and is useonly when one is in a hurry to get a measureof variability
  14. 14. Variance
  15. 15. Variance
  16. 16. Standard DeviationThe standard deviation formula is very simple: itis the square root of the variance. It is the mostcommonly used measure of spread.An important attribute of the standard deviationas a measure of spread is that if the mean andstandard deviation of a normal distributionare known, it is possible to compute thepercentile rank associated with any given score.
  17. 17. Standard DeviationIn a normal distribution, about 68% of thescores are within one standard deviation of themean and about 95% of the scores are withintwo standard deviations of the mean.The standard deviation has proven to be anextremely useful measure of spread in partbecause it is mathematically tractable. Manyformulas in inferential statistics use thestandard deviation.
  18. 18. Coefficient of Variation
  19. 19. Kurtosis
  20. 20. KURTOSIS - refers to how sharply peakeda distribution is. A value for kurtosis is includedwith the graphical summary:· Values close to 0 indicate normally peakeddata.· Negative values indicate a distribution that isflatter than normal.· Positive values indicate a distribution with asharper than normal peak.
  21. 21. Skewness
  22. 22. Samples and PopulationPopulation – as used in research, refers to allthe members of a particular group.It is the group of interest to the researcherThis is the group of whom the researcherwould like to generalize the results of astudy
  23. 23.  A target population is the actual population towhom the researcher would like to generalize Accessible population is the population to whomthe researcher is entitled to generalize
  24. 24. SAMPLINGThis is the process of selecting the individualswho will participate in a research study.Any part of the population of individuals of whominformation is obtained.A representative sample is a sample that is similar tothe population to whom the researcher is entitledto generalize
  25. 25. PROBABILITY AND NON-PROBABILITYSAMPLINGA sampling procedure that gives every element ofthe population a (known) nonzero chance ofbeing selected in the sample is called probabilitysampling. Otherwise, the sampling procedure iscalled non-probability sampling.Whenever possible, probability sampling isused because there is no objective way ofassessing the reliability of inferences undernon-zero probability sampling.
  26. 26. METHODS OF PROBABILITYSAMPLING1. simple random sampling2.systematic sampling3.stratified sampling4. cluster sampling5. two-stage random sampling
  27. 27. Simple Random SamplingThis is a sample selected froma population in such a mannerthat all members of thepopulation have an equalchance of being selected
  28. 28. Stratified Random SamplingSample selected so that certaincharacteristics are represented inthe sample in the same proportionas they occur in the population
  29. 29. Cluster Random SampleThis is obtained by usinggroups as the sampling unitrather than individuals.
  30. 30. Two-Stage Random SampleSelects groups randomly andthen chooses individualsrandomly from these groups.
  31. 31. Non-Probability Sampling1. accidental or conveniencesampling2. purposive sampling3. quota sampling4. snowball or referral sampling 5. systematic sampling
  32. 32. Systematic SampleThis is obtained by selectingevery nth name in a population
  33. 33. Convenience SamplingAny group of individuals thatis conveniently available to bestudied
  34. 34. Purposive SamplingConsist of individuals whohave special qualifications ofsome sort or are deemedrepresentative on the basis ofprior evidence
  35. 35. Quota SamplingIn quota sampling, the population is firstsegmented into mutually exclusive sub-groups,just as in stratified sampling. Then judgment isused to select the subjects or units from eachsegment based on a specified proportion. Forexample, an interviewer may be told to sample200 females and 300 males between the age of45 and 60. This means that individuals can puta demand on who they want to sample(targeting)
  36. 36. Snow ball Samplingsnowball sampling is a technique for developing aresearch sample where existing study subjects recruitfuture subjects from among their acquaintances. Thusthe sample group appears to grow like a rollingsnowball. As the sample builds up, enough data isgathered to be useful for research. This samplingtechnique is often used in hidden populations whichare difficult for researchers to access; examplepopulations would be drug users or prostitutes. Assample members are not selected from a samplingframe, snowball samples are subject to numerousbiases
  37. 37. General Classification ofCollecting Data1. Census or complete enumeration-is theprocess of gathering information from every unitin the population.- not always possible to get timely, accurate andeconomical data- costly, if the number of units in the population istoo large2. Survey sampling- is the process of obtaininginformation from the units in the selected sample.Advantages: reduced cost, greater speed, greaterscope, and greater accuracy
  38. 38. Sample sizeSamples should be as large as a researcher canobtain with a reasonable expenditure of time andenergy.As suggested, a minimum number of subjects is 100for a descriptive study , 50 for a correlational study,and 30 in each group for experimental and causal-comparative designAccording to Padua , for n parameters, minimum ncould be computed as n >= (p +3) p/2 where p =parameters , say if p = 4, thus minimum n = 14.
  39. 39. Inferential StatisticsThis is a formalized techniques used to makeconclusions about populations based on samplestaken from the populations.
  40. 40. HypothesisHypothesis is defined as the tentative theory orsupposition provisionally adopted to explain certain factsand to guide in the investigation of others.A statistical hypothesis is an assertion or statement thatmay or may not be true concerning one or morepopulation.Example:1. A leading drug in the treatment of hypertension has anadvertised therapeutic success rate of 83%. A medicalresearcher believes he has found a new drug for treatinghypertensive patients that has higher therapeutic successrate than the leading than the leading drug with fewer sideeffect.
  41. 41. The Statistical Hypothesis :HO: The new drug is no better than the old one (p=0.83)H1: The new drug is better than the old one ( p> 0.83)Example 2. A social researcher is conducting a studyto determine if the level of women’s participation incommunity extension programs of the barangay canbe affected by their educational attainment ,occupation, income, civil status, and age.
  42. 42. HO: The level of women’s participation in communityextension programs is not affected by theirattainment, occupation, income , civil status and age.H1: The level of women’s participation in communityextension programs is affected by their attainment,occupation, income , civil status and age.Example 3: A community organizer wants to comparethe three community organizing strategies applied tocultural minorities in terms of effectiveness.
  43. 43. A. Hypothesis TestingSteps in Hypothesis Testing1. Formulate the null hypothesis andthe alternative hypothesis- this is the statistical hypothesiswhich are assumptions or guessesabout the population involved. Inshort, these are statements aboutthe probability distributions of thepopulations
  44. 44. Null HypothesisThis is a hypothesis of “ no effect “.It is usually formulated for the expresspurpose of being rejected, that is, it is thenegation of the point one is trying tomake.This is the hypothesis that two or morevariables are not related or that two ormore statistics are not significantlydifferent.
  45. 45. Alternative HypothesisThis is the operational statement ofthe researcher’s hypothesisThe hypothesis derived from thetheory of the investigator andgenerally state a specified relationshipbetween two or more variables or thattwo or more statistics significantlydiffer.
  46. 46. Two Ways of Stating theAlternative Hypothesis1. Predictive - specifies the type of relationshipexisting between two or more variables (direct orindirect) or specifies the direction of the differencebetween two or more statistics2. Non- Predictive - does not specify the type ofrelationship or the direction of the difference
  47. 47. C. LEVEL OF SIGNIFICANCE (α)α is the maximum probability with which wewould be willing to risk Type I Error (Thehypothesis can be inappropriately rejected ).The error of rejecting a null hypothesis when itis actually true. Plainly speaking, it occurswhen we are observing a difference when intruth there is none, thus indicating a test ofpoor specificity. An example of this would be ifa test shows that a woman is pregnant when inreality she is not.
  48. 48. In other words, the level of significance determinesthe risk a researcher would be willing to take in histest.The choice of alpha is primarily dependent on thepractical application of the result of the study.
  49. 49. Examples of α.05 (95 % confident of the claim).01 (99 % confident of the claim) But take note, α is not always .05 or .01. This couldmathematically be computed based from theformula :where the variance , no of samples and itsdifference are predetermined – Chebychev’s samplesize formula.
  50. 50. D. Defining a Region of RejectionThe region of rejection is a region ofthe null sampling distribution. Itconsists of a set of possible values whichare so extreme that when the nullhypothesis is true the probability issmall (i.e. equal to alpha) that thesample we observe will yield a valuewhich is among them.
  51. 51. E. Collect the data and computethe value of the test- statisticF . Collect the data and compute thevalue of the test –statistic.G. State your decision.H. State your conclusion.
  52. 52. B. Choose an Appropriate Statistical Test fortesting the Null HypothesisThe choice of a statistical test for the analysisof your data requires careful and deliberatejudgment.PRIMARY CONSIDERATIONS:The choice of a statistical test is dictated bythe questions for which the research isdesignedThe level, the distribution , and dispersion ofdata also suggest the type of statistical test tobe used
  53. 53. SECONDARY CONSIDERATIONSThe extent of your knowledge instatisticsAvailability of resources inconnection with the computationand interpretation of data
  54. 54. Choice of Statistical TestsThis is designed to help youdevelop a framework for choosingthe correct statistic to test yourhypothesis. It begins with a set of questionsyou should ask when selecting yourtest.It is followed by demonstrations ofthe factors that are important toconsider when choosing yourstatistic.
  55. 55. Choice of Statistical TestsPresented below are fourquestions you should ask andanswer when trying to determinewhich statistical procedure is mostappropriate to test yourhypothesis.
  56. 56. Choice of Statistical TestsWhat are the independent anddependent variables?What is the scale of measurement ofthe study variables?How many samples/groups are inthe design?Have I met the assumptions of thestatistical test selected?
  57. 57. Choice of Statistical TestsTo determine which test should beused in any given circumstance, weneed to consider the hypothesis thatis being tested, the independent anddependent variables and their scale ofmeasurement, the study design, andthe assumptions of the test.
  58. 58. VariablesBefore we can begin to choose ourstatistical test, we must determinewhich is the independent and which isthe dependent variable in ourhypothesis.Our dependent variable is always thephenomenon or behavior that we wantto explain or predict.
  59. 59. Defining Independent and DependentVariablesThe independent variable represents apredictor or causal variable in thestudy.In any antecedent-consequentrelationship, the antecedent is theindependent variable and theconsequent is the dependent variable.
  60. 60. Defining Independent and DependentVariablesWith single samples and one dependentvariable, the one-sample Z test, the one-sample t test, and the chi-square goodness-of-fit test are the only statistics that can be used.Students sometimes ask, "but dont you havepopulation data too, so you have two sets ofdata?" Yes and no.Data have to exist or else the populationparameters are defined. But, the researcherdoes not collect these data, they already exist.
  61. 61. Defining Independent and DependentVariablesSo, if you are collecting data on one sampleand comparing those data to informationthat has already been gathered and ispublished, then you are conducting a one-sample test using the one sample/set ofdata collected in this study.For the chi-square goodness-of-fit test, youcan also compare the sample against chanceprobabilities
  62. 62. Defining Independent and DependentVariablesWhen we have a single sample andindependent and dependent variablesmeasured on all subjects, we typically aretesting a hypothesis about the associationbetween two variables. The statistics that wehave learned to test hypotheses aboutassociation include:chi-square test of independenceSpearmans rsPearsons rbivariate regression and multiple regression
  63. 63. Multiple Sample TestsStudies that refer to repeated measurements orpairs of subjects typically collect at least two setsof scores. Studies that refer to specific subgroupsin the population also collect two or more samplesof data. Once you have determined that thedesign uses two or more samples or "groups", thenyou must determine how many samples or groupsare in the design. Studies that are limited to twogroups use either the chi-square statistic, Mann-Whitney U, Wilcoxon test, independent means ttest, or the dependent means t test.
  64. 64. If you have three or more groups in thedesign, the chi-square statistic, Kruskal-Wallis H Test, Friedman ANOVA for ranks,One-way Between-Groups ANOVA, andFactorial ANOVA depending on the natureof the relationship between groups. Some ofthese tests are designed for dependent orcorrelated samples/groups and some aredesigned for samples/groups that arecompletely independent.
  65. 65. Multiple Sample TestsDependent MeansDependent groups refer to some type ofassociation or link in the research designbetween sets of scores. This usually occursin one of three conditions -- repeatedmeasures, linked selection, or matching.Repeated measures designs collect data onsubjects using the same measure on at leasttwo occasions. This often occurs before andafter a treatment or when the same researchsubjects are exposed to two differentexperimental conditions.
  66. 66. Multiple Sample TestsWhen subjects are selected into the study because ofnatural "links or associations", we want to analyze thedata together. This would occur in studies of parent-infant interaction, romantic partners, siblings, or bestfriends. In a study of parents and their children, aparent’s data should be associated with his sons, notsome other childs. Subject matching also producesdependent data. Suppose that an investigator wantedto control for socioeconomic differences in researchsubjects. She might measure socioeconomic statusand then match on that variable. The scores on thedependent variable would then be treated as a pair inthe statistical test.
  67. 67. All statistical procedures for dependent orcorrelated groups treat the data as linked,therefore it is very important that youcorrectly identify dependent groupsdesigns. The statistics that can be used forcorrelated groups are the McNemar Test(two samples or times of measurement),Wilcoxon t Test (two samples), DependentMeans t Test (two samples), FriedmanANOVA for Ranks (three or more samples),Simple Repeated Measures ANOVA (threeor more samples).
  68. 68. Independent MeansWhen there is no subject overlap across groups, we definethe groups as independent. Tests of gender differences area good example of independent groups. We cannot beboth male and female at the same time; the groups arecompletely independent. If you want to determinewhether samples are independent or not, ask yourself,"Can a person be in one group at the same time he or sheis in another?" If the answer is no (cant be in a remedialeducation program and a regular classroom at the sametime; cant be a freshman in high school and a sophomorein high school at the same time), then the groups areindependent.
  69. 69. The statistics that can be used forindependent groups include the chi-square test of independence (two ormore groups), Mann-Whitney U Test(two groups), Independent Means ttest (two groups), One-Way Between-Groups ANOVA (three or moregroups), and Factorial ANOVA (two ormore independent variables).
  70. 70. Scales of MeasurementsOnce we have identified the independentand dependent variables, our next step inchoosing a statistical test is to identify thescale of measurement of the variables.All of the parametric tests that we havelearned to date require an interval or ratioscale of measurement for the dependentvariable.
  71. 71. Scales of MeasurementsIf you are working with a dependentvariable that has a nominal or ordinalscale of measurement, then you mustchoose a nonparametric statistic totest your hypothesis
  72. 72. How many Samples / Groups are in theDesignOnce you have identified the scale ofmeasurement of the dependent variable,you want to determine how many samplesor "groups" are in the study design.Designs for which one-sample tests (e.g.,Z test; t test; Pearson and Spearmancorrelations; chi-square goodness-of-fit)are appropriate to collect only one set or"sample" of data.
  73. 73. How many Samples / Groups are in theDesignThere must be at least two sets ofscores or two "samples" for anystatistic that examines differencesbetween groups (e.g. , t test fordependent means; t test forindependent means; one-way ANOVA;Friedman ANOVA; chi-square test ofindependence) .
  74. 74. Parametric TestsParametric statistics are used when ourdata are measured on interval or ratioscales of measurementTend to need larger samplesData should fit a particular distribution;transformed the data into that particulardistributionSamples are normally drawn randomlyfrom the populationFollows the assumption of normality –meaning the data is normally distributed.
  75. 75. Parametric AssumptionsListed below are the most frequentlyencountered assumptions for parametric tests.Statistical procedures are available for testingthese assumptions.The Kolmogorov-Smirnov Test is used todetermine how likely it is that a sample camefrom a population that is normally distributed.
  76. 76. Parametric AssumptionsThe Levene test is used to test the assumption ofequal variances.If we violate test assumptions, the statistic chosencannot be applied. In this circumstance we havetwo options:We can use a data transformationWe can choose a nonparametric statisticIf data transformations are selected, thetransformation must correct the violated assumption.If successful, the transformation is applied and theparametric statistic is used for data analysis.
  77. 77. Types of Parametric TestsZ testOne-way ANOVAOne-Sample t testFactorial ANOVAt test for dependent meansPearson’s rt test for independent meansBivariate/Multiple regression
  78. 78. Non-Parametric TestsInference procedures which are likelydistribution free.Nonparametric statistics are used when ourdata are measured on a nominal or ordinalscale of measurement.All other nonparametric statistics areappropriate when data are measured on anordinal scale of measurement.Example to this is the sign tests. These aretests designed to draw inferences aboutmedians.
  79. 79. Types of Non-parametric TestsSigned TestsChi-square statistics and theirmodifications (e.g., McNemar Test) areused for nominal data.Wilcoxon Test – alternative to t – test inthe parametric testKruskal- Wallis Test - alternative toANOVAFreidman Test – alternative to ANOVA
  80. 80. Goodness of Fit Test
  81. 81. Choosing the Correct StatisticalTestsSummaryFive issues must be considered whenchoosing statistical tests.Scale of measurementNumber of samples/groupsNature of the relationship betweengroupsNumber of variablesAssumptions of statistical tests
  82. 82. Introduction to Multiple and Non-Linear Regression
  83. 83. Hands –On Statistical Software
  84. 84. Thank you very much!Hope you are nowready to conduct yourstudy