Prof. JOY V. LORIN-PICARDAVAO DEL NORTE STATE COLLEGENEW VISAYAS, PANABO CITY
TOPIC OUTLINEPART 1Role of Statistics in ResearchDescriptive StatisticsHands –On Statistical SoftwareSample and PopulationSampling ProceduresSample SizeHands –On Statistical SoftwareInferential StatisticsHypothesis TestingHands –On Statistical Software
TOPIC OUTLINEPART 2Choice of Statistical TestsDefining Independent and DependentVariablesHands –On Statistical SoftwareScales of MeasurementsHow many Samples / Groups are in the DesignPART 3Parametric TestsHands –On Statistical SoftwarePART 4Non-Parametric TestsHands –On Statistical Software
TOPIC OUTLINEPART 5Goodness of FitHands –On Statistical SoftwarePART 6Choosing the Correct Statistical TestsHands –On Statistical SoftwareIntroduction to Multiple and Non-LinearRegressionHands –On Statistical Software
Role of Statistics in ResearchNormally use to analyze dataTo organize and make sense out of large amountof dataThis is basic to intelligent reading researcharticleHas significant contributions in social sciences,applied sciences and even business andeconomicsStatistical researches make inferences aboutpopulation characteristics on the basis of one ormore samples that have been studied.
How is Statistics look into ?1. Descriptive – this gives us information ,or simple describe the sample we arestudying.2. Correlational - this enables us to relatevariables and establish relationshipbetween and among variables which areuseful in making predictions.3. Inferential – this is going beyond thesample and make inference on thepopulation.
Descriptive Statistics N - total population/sample size from any givenpopulationExampleMinutes Spent on the Phone102 124 108 86 103 8271 104 112 118 87 95103 116 85 122 87 100105 97 107 67 78 125109 99 105 99 101 92
Range, Mean, Median and ModeThe terms mean, median, mode, and range describeproperties of statistical distributions. In statistics, a distributionis the set of all possible values for terms that represent definedevents. The value of a term, when expressed as a variable, iscalled a random variable. There are two major types ofstatistical distributions. The first type has a discrete randomvariable. This means that every term has a precise, isolatednumerical value. An example of a distribution with a discreterandom variable is the set of results for a test taken by a class inschool. The second major type of distribution has a continuousrandom variable. In this situation, a term can acquire anyvalue within an unbroken interval or span. Such a distributionis called a probability density function. This is the sort offunction that might, for example, be used by a computer in anattempt to forecast the path of a weather system.
MeanThe most common expression for the mean of a statisticaldistribution with a discrete random variable is themathematical average of all the terms. To calculate it,add up the values of all the terms and then divide by thenumber of terms. This expression is also called thearithmetic mean. There are other expressions for themean of a finite set of terms but these forms are rarelyused in statistics. The mean of a statistical distributionwith a continuous random variable, also called theexpected value, is obtained by integrating the product ofthe variable with its probability as defined by thedistribution. The expected value is denoted by thelowercase Greek letter mu (µ).
Median The median of a distribution with a discrete random variabledepends on whether the number of terms in the distribution iseven or odd. If the number of terms is odd, then the median isthe value of the term in the middle. This is the value such thatthe number of terms having values greater than or equal to it isthe same as the number of terms having values less than orequal to it. If the number of terms is even, then the median isthe average of the two terms in the middle, such that thenumber of terms having values greater than or equal to it is thesame as the number of terms having values less than or equal toit. The median of a distribution with a continuous randomvariable is the value m such that the probability is at least1/2 (50%) that a randomly chosen point on the functionwill be less than or equal to m, and the probability is atleast 1/2 that a randomly chosen point on the function willbe greater than or equal to m.
ModeThe mode of a distribution with a discrete randomvariable is the value of the term that occurs the mostoften. It is not uncommon for a distribution with adiscrete random variable to have more than one mode,especially if there are not many terms. This happens whentwo or more terms occur with equal frequency, and moreoften than any of the others. A distribution with twomodes is called bimodal. A distribution with three modesis called trimodal. The mode of a distribution with acontinuous random variable is the maximum value ofthe function. As with discrete distributions, there may bemore than one mode.
RangeThe range of a distribution with a discreterandom variable is the difference between themaximum value and the minimum value. For adistribution with a continuous random variable,the range is the difference between the twoextreme points on the distribution curve,where the value of the function falls to zero.For any value outside the range of a distribution,the value of the function is equal to 0.The least reliable of the measure and is useonly when one is in a hurry to get a measureof variability
Standard DeviationThe standard deviation formula is very simple: itis the square root of the variance. It is the mostcommonly used measure of spread.An important attribute of the standard deviationas a measure of spread is that if the mean andstandard deviation of a normal distributionare known, it is possible to compute thepercentile rank associated with any given score.
Standard DeviationIn a normal distribution, about 68% of thescores are within one standard deviation of themean and about 95% of the scores are withintwo standard deviations of the mean.The standard deviation has proven to be anextremely useful measure of spread in partbecause it is mathematically tractable. Manyformulas in inferential statistics use thestandard deviation.
Coefficient of Variation
KURTOSIS - refers to how sharply peakeda distribution is. A value for kurtosis is includedwith the graphical summary:· Values close to 0 indicate normally peakeddata.· Negative values indicate a distribution that isflatter than normal.· Positive values indicate a distribution with asharper than normal peak.
Samples and PopulationPopulation – as used in research, refers to allthe members of a particular group.It is the group of interest to the researcherThis is the group of whom the researcherwould like to generalize the results of astudy
A target population is the actual population towhom the researcher would like to generalize Accessible population is the population to whomthe researcher is entitled to generalize
SAMPLINGThis is the process of selecting the individualswho will participate in a research study.Any part of the population of individuals of whominformation is obtained.A representative sample is a sample that is similar tothe population to whom the researcher is entitledto generalize
PROBABILITY AND NON-PROBABILITYSAMPLINGA sampling procedure that gives every element ofthe population a (known) nonzero chance ofbeing selected in the sample is called probabilitysampling. Otherwise, the sampling procedure iscalled non-probability sampling.Whenever possible, probability sampling isused because there is no objective way ofassessing the reliability of inferences undernon-zero probability sampling.
METHODS OF PROBABILITYSAMPLING1. simple random sampling2.systematic sampling3.stratified sampling4. cluster sampling5. two-stage random sampling
Simple Random SamplingThis is a sample selected froma population in such a mannerthat all members of thepopulation have an equalchance of being selected
Stratified Random SamplingSample selected so that certaincharacteristics are represented inthe sample in the same proportionas they occur in the population
Cluster Random SampleThis is obtained by usinggroups as the sampling unitrather than individuals.
Two-Stage Random SampleSelects groups randomly andthen chooses individualsrandomly from these groups.
Non-Probability Sampling1. accidental or conveniencesampling2. purposive sampling3. quota sampling4. snowball or referral sampling 5. systematic sampling
Systematic SampleThis is obtained by selectingevery nth name in a population
Convenience SamplingAny group of individuals thatis conveniently available to bestudied
Purposive SamplingConsist of individuals whohave special qualifications ofsome sort or are deemedrepresentative on the basis ofprior evidence
Quota SamplingIn quota sampling, the population is firstsegmented into mutually exclusive sub-groups,just as in stratified sampling. Then judgment isused to select the subjects or units from eachsegment based on a specified proportion. Forexample, an interviewer may be told to sample200 females and 300 males between the age of45 and 60. This means that individuals can puta demand on who they want to sample(targeting)
Snow ball Samplingsnowball sampling is a technique for developing aresearch sample where existing study subjects recruitfuture subjects from among their acquaintances. Thusthe sample group appears to grow like a rollingsnowball. As the sample builds up, enough data isgathered to be useful for research. This samplingtechnique is often used in hidden populations whichare difficult for researchers to access; examplepopulations would be drug users or prostitutes. Assample members are not selected from a samplingframe, snowball samples are subject to numerousbiases
General Classification ofCollecting Data1. Census or complete enumeration-is theprocess of gathering information from every unitin the population.- not always possible to get timely, accurate andeconomical data- costly, if the number of units in the population istoo large2. Survey sampling- is the process of obtaininginformation from the units in the selected sample.Advantages: reduced cost, greater speed, greaterscope, and greater accuracy
Sample sizeSamples should be as large as a researcher canobtain with a reasonable expenditure of time andenergy.As suggested, a minimum number of subjects is 100for a descriptive study , 50 for a correlational study,and 30 in each group for experimental and causal-comparative designAccording to Padua , for n parameters, minimum ncould be computed as n >= (p +3) p/2 where p =parameters , say if p = 4, thus minimum n = 14.
Inferential StatisticsThis is a formalized techniques used to makeconclusions about populations based on samplestaken from the populations.
HypothesisHypothesis is defined as the tentative theory orsupposition provisionally adopted to explain certain factsand to guide in the investigation of others.A statistical hypothesis is an assertion or statement thatmay or may not be true concerning one or morepopulation.Example:1. A leading drug in the treatment of hypertension has anadvertised therapeutic success rate of 83%. A medicalresearcher believes he has found a new drug for treatinghypertensive patients that has higher therapeutic successrate than the leading than the leading drug with fewer sideeffect.
The Statistical Hypothesis :HO: The new drug is no better than the old one (p=0.83)H1: The new drug is better than the old one ( p> 0.83)Example 2. A social researcher is conducting a studyto determine if the level of women’s participation incommunity extension programs of the barangay canbe affected by their educational attainment ,occupation, income, civil status, and age.
HO: The level of women’s participation in communityextension programs is not affected by theirattainment, occupation, income , civil status and age.H1: The level of women’s participation in communityextension programs is affected by their attainment,occupation, income , civil status and age.Example 3: A community organizer wants to comparethe three community organizing strategies applied tocultural minorities in terms of effectiveness.
A. Hypothesis TestingSteps in Hypothesis Testing1. Formulate the null hypothesis andthe alternative hypothesis- this is the statistical hypothesiswhich are assumptions or guessesabout the population involved. Inshort, these are statements aboutthe probability distributions of thepopulations
Null HypothesisThis is a hypothesis of “ no effect “.It is usually formulated for the expresspurpose of being rejected, that is, it is thenegation of the point one is trying tomake.This is the hypothesis that two or morevariables are not related or that two ormore statistics are not significantlydifferent.
Alternative HypothesisThis is the operational statement ofthe researcher’s hypothesisThe hypothesis derived from thetheory of the investigator andgenerally state a specified relationshipbetween two or more variables or thattwo or more statistics significantlydiffer.
Two Ways of Stating theAlternative Hypothesis1. Predictive - specifies the type of relationshipexisting between two or more variables (direct orindirect) or specifies the direction of the differencebetween two or more statistics2. Non- Predictive - does not specify the type ofrelationship or the direction of the difference
C. LEVEL OF SIGNIFICANCE (α)α is the maximum probability with which wewould be willing to risk Type I Error (Thehypothesis can be inappropriately rejected ).The error of rejecting a null hypothesis when itis actually true. Plainly speaking, it occurswhen we are observing a difference when intruth there is none, thus indicating a test ofpoor specificity. An example of this would be ifa test shows that a woman is pregnant when inreality she is not.
In other words, the level of significance determinesthe risk a researcher would be willing to take in histest.The choice of alpha is primarily dependent on thepractical application of the result of the study.
Examples of α.05 (95 % confident of the claim).01 (99 % confident of the claim) But take note, α is not always .05 or .01. This couldmathematically be computed based from theformula :where the variance , no of samples and itsdifference are predetermined – Chebychev’s samplesize formula.
D. Defining a Region of RejectionThe region of rejection is a region ofthe null sampling distribution. Itconsists of a set of possible values whichare so extreme that when the nullhypothesis is true the probability issmall (i.e. equal to alpha) that thesample we observe will yield a valuewhich is among them.
E. Collect the data and computethe value of the test- statisticF . Collect the data and compute thevalue of the test –statistic.G. State your decision.H. State your conclusion.
B. Choose an Appropriate Statistical Test fortesting the Null HypothesisThe choice of a statistical test for the analysisof your data requires careful and deliberatejudgment.PRIMARY CONSIDERATIONS:The choice of a statistical test is dictated bythe questions for which the research isdesignedThe level, the distribution , and dispersion ofdata also suggest the type of statistical test tobe used
SECONDARY CONSIDERATIONSThe extent of your knowledge instatisticsAvailability of resources inconnection with the computationand interpretation of data
Choice of Statistical TestsThis is designed to help youdevelop a framework for choosingthe correct statistic to test yourhypothesis. It begins with a set of questionsyou should ask when selecting yourtest.It is followed by demonstrations ofthe factors that are important toconsider when choosing yourstatistic.
Choice of Statistical TestsPresented below are fourquestions you should ask andanswer when trying to determinewhich statistical procedure is mostappropriate to test yourhypothesis.
Choice of Statistical TestsWhat are the independent anddependent variables?What is the scale of measurement ofthe study variables?How many samples/groups are inthe design?Have I met the assumptions of thestatistical test selected?
Choice of Statistical TestsTo determine which test should beused in any given circumstance, weneed to consider the hypothesis thatis being tested, the independent anddependent variables and their scale ofmeasurement, the study design, andthe assumptions of the test.
VariablesBefore we can begin to choose ourstatistical test, we must determinewhich is the independent and which isthe dependent variable in ourhypothesis.Our dependent variable is always thephenomenon or behavior that we wantto explain or predict.
Defining Independent and DependentVariablesThe independent variable represents apredictor or causal variable in thestudy.In any antecedent-consequentrelationship, the antecedent is theindependent variable and theconsequent is the dependent variable.
Defining Independent and DependentVariablesWith single samples and one dependentvariable, the one-sample Z test, the one-sample t test, and the chi-square goodness-of-fit test are the only statistics that can be used.Students sometimes ask, "but dont you havepopulation data too, so you have two sets ofdata?" Yes and no.Data have to exist or else the populationparameters are defined. But, the researcherdoes not collect these data, they already exist.
Defining Independent and DependentVariablesSo, if you are collecting data on one sampleand comparing those data to informationthat has already been gathered and ispublished, then you are conducting a one-sample test using the one sample/set ofdata collected in this study.For the chi-square goodness-of-fit test, youcan also compare the sample against chanceprobabilities
Defining Independent and DependentVariablesWhen we have a single sample andindependent and dependent variablesmeasured on all subjects, we typically aretesting a hypothesis about the associationbetween two variables. The statistics that wehave learned to test hypotheses aboutassociation include:chi-square test of independenceSpearmans rsPearsons rbivariate regression and multiple regression
Multiple Sample TestsStudies that refer to repeated measurements orpairs of subjects typically collect at least two setsof scores. Studies that refer to specific subgroupsin the population also collect two or more samplesof data. Once you have determined that thedesign uses two or more samples or "groups", thenyou must determine how many samples or groupsare in the design. Studies that are limited to twogroups use either the chi-square statistic, Mann-Whitney U, Wilcoxon test, independent means ttest, or the dependent means t test.
If you have three or more groups in thedesign, the chi-square statistic, Kruskal-Wallis H Test, Friedman ANOVA for ranks,One-way Between-Groups ANOVA, andFactorial ANOVA depending on the natureof the relationship between groups. Some ofthese tests are designed for dependent orcorrelated samples/groups and some aredesigned for samples/groups that arecompletely independent.
Multiple Sample TestsDependent MeansDependent groups refer to some type ofassociation or link in the research designbetween sets of scores. This usually occursin one of three conditions -- repeatedmeasures, linked selection, or matching.Repeated measures designs collect data onsubjects using the same measure on at leasttwo occasions. This often occurs before andafter a treatment or when the same researchsubjects are exposed to two differentexperimental conditions.
Multiple Sample TestsWhen subjects are selected into the study because ofnatural "links or associations", we want to analyze thedata together. This would occur in studies of parent-infant interaction, romantic partners, siblings, or bestfriends. In a study of parents and their children, aparent’s data should be associated with his sons, notsome other childs. Subject matching also producesdependent data. Suppose that an investigator wantedto control for socioeconomic differences in researchsubjects. She might measure socioeconomic statusand then match on that variable. The scores on thedependent variable would then be treated as a pair inthe statistical test.
All statistical procedures for dependent orcorrelated groups treat the data as linked,therefore it is very important that youcorrectly identify dependent groupsdesigns. The statistics that can be used forcorrelated groups are the McNemar Test(two samples or times of measurement),Wilcoxon t Test (two samples), DependentMeans t Test (two samples), FriedmanANOVA for Ranks (three or more samples),Simple Repeated Measures ANOVA (threeor more samples).
Independent MeansWhen there is no subject overlap across groups, we definethe groups as independent. Tests of gender differences area good example of independent groups. We cannot beboth male and female at the same time; the groups arecompletely independent. If you want to determinewhether samples are independent or not, ask yourself,"Can a person be in one group at the same time he or sheis in another?" If the answer is no (cant be in a remedialeducation program and a regular classroom at the sametime; cant be a freshman in high school and a sophomorein high school at the same time), then the groups areindependent.
The statistics that can be used forindependent groups include the chi-square test of independence (two ormore groups), Mann-Whitney U Test(two groups), Independent Means ttest (two groups), One-Way Between-Groups ANOVA (three or moregroups), and Factorial ANOVA (two ormore independent variables).
Scales of MeasurementsOnce we have identified the independentand dependent variables, our next step inchoosing a statistical test is to identify thescale of measurement of the variables.All of the parametric tests that we havelearned to date require an interval or ratioscale of measurement for the dependentvariable.
Scales of MeasurementsIf you are working with a dependentvariable that has a nominal or ordinalscale of measurement, then you mustchoose a nonparametric statistic totest your hypothesis
How many Samples / Groups are in theDesignOnce you have identified the scale ofmeasurement of the dependent variable,you want to determine how many samplesor "groups" are in the study design.Designs for which one-sample tests (e.g.,Z test; t test; Pearson and Spearmancorrelations; chi-square goodness-of-fit)are appropriate to collect only one set or"sample" of data.
How many Samples / Groups are in theDesignThere must be at least two sets ofscores or two "samples" for anystatistic that examines differencesbetween groups (e.g. , t test fordependent means; t test forindependent means; one-way ANOVA;Friedman ANOVA; chi-square test ofindependence) .
Parametric TestsParametric statistics are used when ourdata are measured on interval or ratioscales of measurementTend to need larger samplesData should fit a particular distribution;transformed the data into that particulardistributionSamples are normally drawn randomlyfrom the populationFollows the assumption of normality –meaning the data is normally distributed.
Parametric AssumptionsListed below are the most frequentlyencountered assumptions for parametric tests.Statistical procedures are available for testingthese assumptions.The Kolmogorov-Smirnov Test is used todetermine how likely it is that a sample camefrom a population that is normally distributed.
Parametric AssumptionsThe Levene test is used to test the assumption ofequal variances.If we violate test assumptions, the statistic chosencannot be applied. In this circumstance we havetwo options:We can use a data transformationWe can choose a nonparametric statisticIf data transformations are selected, thetransformation must correct the violated assumption.If successful, the transformation is applied and theparametric statistic is used for data analysis.
Types of Parametric TestsZ testOne-way ANOVAOne-Sample t testFactorial ANOVAt test for dependent meansPearson’s rt test for independent meansBivariate/Multiple regression
Non-Parametric TestsInference procedures which are likelydistribution free.Nonparametric statistics are used when ourdata are measured on a nominal or ordinalscale of measurement.All other nonparametric statistics areappropriate when data are measured on anordinal scale of measurement.Example to this is the sign tests. These aretests designed to draw inferences aboutmedians.
Types of Non-parametric TestsSigned TestsChi-square statistics and theirmodifications (e.g., McNemar Test) areused for nominal data.Wilcoxon Test – alternative to t – test inthe parametric testKruskal- Wallis Test - alternative toANOVAFreidman Test – alternative to ANOVA
Goodness of Fit Test
Choosing the Correct StatisticalTestsSummaryFive issues must be considered whenchoosing statistical tests.Scale of measurementNumber of samples/groupsNature of the relationship betweengroupsNumber of variablesAssumptions of statistical tests
Introduction to Multiple and Non-Linear Regression
Hands –On Statistical Software
Thank you very much!Hope you are nowready to conduct yourstudy