1. How to correct the selectionbias in management researchTeamBin XuOualid EL OuardiShabnam kazempurTeacher :Mr Christophe Benavent
2. GDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 2
3. I.: Overview1.1 What is the selection bias?Selection bias is the error of distorting a statistical analysis by pre- orpost-selecting the samples. Typically this causes measures of statisticalsignificance to appear much stronger than they are, but it is also possibleto cause completely illusory artifacts. Selection bias can be the result ofscientific fraud which manipulate data directly, but more often is eitherunconscious or due to biases in the instruments used for observation.1.2 Reasons for selection bias Figure 1The figure 1 shows three main reasons, selective non-response,incomplete observability and the self-slection.GDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 3
4. 1.2.1 Selective non responseIt means there are not enough the response. For example, the 1997 DutchLabour Force Survey (LFS) had a response of 56%. Apart from the LFS,other socio-cultural surveys in the Netherlands also show response ratesbetween 50% and 60%. Panel-studies are even more problematic. Theresponse rate of the two-wave Dutch Parliamentary Election Study(DPES) has been below 50% since 1981 and only 43% of the electorateparticipated in 1998 (Aarts, Van der Kolk &Kamp, 1999, pp. 22-24).Apparently, not enough data is hard to be convincing, because it isprobable the people who have not responded are those who are not agreewith our problem.1.2.2 Incomplete observability,It is included of two types: censored data and truncated data.Censored data:Censored data point are those whose measured properties are not knownprecisely, but are known to lie above or below some limiting sensitivity.For example, suppose a study is conducted to measure the impact of adrug on mortality. In such a study, it may be known that an individualsage at death is at least 75 years. Such a situation could occur if theindividual disenrolled from the study at age 75, or if the individual iscurrently alive at the age of 75. Censoring also occurs when a valueoccurs outside the range of a measuring instrument. For example, abathroom scale might only measure up to 300 lbs. If a 350 lb individual isweighed using the scale, the observer would only know that theindividuals weight is at least 300 lbs.Truncated data:GDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 4
5. Truncated data points are those which are missing from the samplealtogether due to sensitivity limits. For example, if an experiment were tobe conducted to count the distribution of sizes of fish in a lake, a netmight be used to catch a representative sample of fish. If the net had amesh size of 1 cm, then no fish narrower than 1 cm wide would be foundin the sample. This is a result of the method of selection: there is no wayof knowing whether there are any fish smaller than 1 cm based on anexperiment using that net.Censoring and truncation:As the example before mentioned, Censoring is when an observation isincomplete due to some random cause. The cause of the censoring mustbe independent of the event of interest if we are to use standard methodsof analysis. Truncation is a variant of censoring which occurs when theincomplete nature of the observation is due to a systematic selectionprocess inherent to the study design.1.2.3 Self-selectionIt is a term used to indicate any situation in which individuals selectthemselves into a group, causing a biased sample. It is commonly used todescribe situations where the characteristics of the people which causethem to select themselves in the group create abnormal or undesirableconditions in the group. Self-selection is a major problem in research insociology, psychology, economics and many other social sciences.Self-selection makes it difficult to determine causation. For example, onemight note significantly higher test scores among those who participate inGDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 5
6. a test preparation course, and credit the course for the difference.However, due to self-selection, there are a number of differences betweenthe people who chose to take the course and those who chose not to.Arguably, those who chose to take the course might have been morehard-working, studious, and dedicated than those who did not, and thatdifference in dedication may have affected the test scores between thetwo groups. If that was the case, then it is not meaningful to simplycompare the two sets of scores. Due to self-selection, there were otherfactors affecting the scores than merely the course itself.Self-selection causes problems for research about programs or products.In particular, self-selection makes it difficult to evaluate programs, todetermine whether the program has some effect, and makes it difficult todo market research.1.3 The main selection bias in managementIn the most observational studies in management, selection bias usually isdue to the last two reasons, that‟s because their study object is morespecific, the response from study object is easier to collect. AndTruncation is a variant of censoring which occurs when the incompletenature of the observation is due to a systematic selection process inherentto the study design.In fact we focus on selection bias comes in two main flavors: (1) self-selection of individuals to participate in an activity or survey, or as asubject in an experimental study; (2) selection of samples or studies byresearchers to support a particular, especially censoring data.In the next section, the paper will describe more details of selection biasdue to self-selection and censoring data in management research,GDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 6
7. introduce two basic theory and model in order to correct the selectionbias, which are Tobic model and Heckman model respectively. When theFormer is basis of the latter model and a method to correct selection biasfrom censoring data, and the latter is more useful abroad and a mainstream model to correct selection bias.Solutions:Selection bias started in 1958 with Tubin studies about houseexpenditures and luxury products and Since publication of Heckman‟article in 1979, which with 7300 citation made him Nobble prize winner,there have been thousands of articles in this respect. While searching indifferent articles we identified different categories and trends foridentifying, controlling or removing .The primary methods which were used were mainly mathematical andeconometrical models. Thanks to the huge researches and attention to it,this area is almost a mature area, and is in fact the main source ofprogress in selection bias. However, in many areas there have been manyefforts to simplify the mathematical models and provide insight forresearcher in human science who are not necessarily mathematicians.Emerging experimental and quasi experimental methods like propensityscore method is the results of these efforts.In our paper we will see the mathematical back ground. Then we willwork on experimental methods. Finally we will provide our insights, andsolutions to deal with selection bias. In appendix we have provided someinformation about the software and toolkits which can be used forselection bias.GDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 7
8. We have tried to avoid entering complicated mathematical equations.Even we can claim that we have been more successful than those articleswhich have been written by titles like “intuition about selection bias”.Nonetheless, this is an area which still very dependent to mathematics,and it is not easy to transfer concepts without understanding itsmathematical base. For this reason, in second chapter, we have tried toexplain up to some degree the mathematical bases of selection bias.Although it is simple and primary, it provides the base for those whowant to follow this field through mathematical equations. In addition, itgives good insights to others who only want to understand selection biasqualitatively. But our last chapter is completely qualitative and discussedabout experiments and qualitative approaches.Finally we have provided good bibliography about different subjects inselection bias.GDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 8
9. Econometrical Base of Selection Bias: A :Tobit Model : As we discussed before, a common occurrence in many regression models is the existence of truncation or censoring in the response variable. Tobin (1958) pioneered the study of such models in economics, analyzing household expenditures on durable goods while taking into account the fact that expenditures cannot be negative. That is, for some observations the observed response is not the actual response, but rather the censoring value (often zero), and an indicator that censoring has occurred. More specifically, the so-called Type I Tobit model can be written as a combination of two familiar models. The first model is a Probit model, which determines whether the yi variable is zero or positive and the second model is a Truncated Regression model for the positive values of yi . The Type-I Tobit model assumes that the parameters for the effect of the explanatory variables on the probability that an observation is censored and the effect on the conditional mean of the non-censored observations are the same. In this section we will see an overview on the Tobit Model which mean for us Type-I Tobit model, unless otherwise specified, after that we will give an application of using this model, then we present some comment on the results. The last part introduces many limitations of the Tobit model.I. Overview 1. Truncation and Censoring a. Truncation Truncation occurs when some observations on both the dependent variable and regressors are lost. For example, income may be the GDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 9
10. dependent variable and only low-income people are included in thesample. In effect, truncation occurs when the sample data is drawn from asubset of a larger population.Examples: Let yi = the profit of the i-th firm as a percentage of assets and x i = the four firm concentration ratio of industry the firm is in. Suppose only firms with positive profit rates are observed and firms with negative profit rates are not observed. In this case a = 0 and we have a problem where the dependent variable is left truncated. In the case where ( yi , xi ) is observed only when yi a (left truncation) or when yi b (right truncation) or when c yi d (double truncation). A second example: Objects of certain type in a specific region of the sky will not be detected by the instrument if the apparent luminosity of objects is less than a certain lower limit. This often happens due to instrumental limitations or due to our position in the universe. For instance, suppose that the data concern the purchases of new cars, with yi the price of the car and xi characteristics of the buyer like age and income class. Then no observations on yi can be below the price of the cheapest new car. Some households may want to buy a new car but find it too expensive, in which case they do not purchase a new car and are not part of the observed data. This truncation effect should be taken into account, for instance, if one wants to predict the potential sales of a cheaper new type of car, because most potential buyers will not be part of the observed sample. This figure shows an example of a truncated normal density with truncation from below (at x = -1).GDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 10
11. b. CensoringCensoring occurs when data on the dependent variable is lost (or limited)but not data on the regressors. Sources/events can be detected, but thevalues (measurements) are not known completely. We only know that thevalue is less than some number. Case where all ( yi , xi ) are observed it isjust that when yi "passes" the truncation point, yi is recorded as thetruncation point. As in the truncation model you can have left-censoring,right-censoring, or upper and lower-censoring. For discussion purposeslets consider the case where yi 0 . We consider a contribution to charityas example. Some people give to the designated charity and some peopledo not.Examples: If we consider people of all income levels may be included in the sample, but for some reason the income of high-income people may be top-coded as, say, $100,000. Censoring is a defect in the sample - if there were no censoring, then the data would be a representative sample from the population of interest. Truncation entails a greater loss of information than censoring. Long (1997, 188) provides a nice picture of truncation and censoring. This figure shows an example shows a censored normal density with censoring from below (at x = 0),GDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 11
12. N.B: The main difference between censoring and truncation is thatcensored object is detectable while the object is not even detectable in thecase of truncation. 2. Tobit model for censored dataThe dependent variable is called censored when the response cannot takevalues below (left censored) or above (right censored) a certain thresholdvalue. For instance, in the example on investments in a new financialproduct, the investments are either zero or positive. And, in decidingabout a new car, one has either to pay the cost of the cheapest car orabstain from buying a new car. The so-called Tobit model relates theobserved outcomes of Y*>0 to an index functionThe Tobit model for censored data is sometimes called the Tobit type 1model, to distinguish it from the Tobit type 2 model that will be discussedin the next section for data with selection effects. In contrast with atruncated sample, where only the responses for y*i > 0 are observed, it isnow assumed that responses y*I = 0 corresponding to y*i < 0 are alsoobserved and that the values of xi for such observations are also known.In practice these zero-responses are of interest, as they provide relevantinformation on economic behaviour. For instance, it is of interest to knowwhich individuals decided not to invest (as other financial products couldGDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 12
13. be developed for this group) or which individuals did not buy a new car (as one could design other cars that appeal more to this group). The Tobit model can be seen as a variation of the Probit model, with one discrete option („failure‟, yi =0) and where the option „success‟ is replaced by the continuous variable yi > 0.II. Application of Tobit model for censored data As we introduced in (2.Tobit model for censored data) If we consider direct marketing data concerning a new financial product, when of the 925 customers, 470 responded to the mailing by investing in the new product. We analyze the censuring sample consisting of these 470 customers. We will not consider only the customers of the bank who decided to invest in the financial product as for truncated case, but we also know the individual haracteristics of the customers who decided not to invest. We will, therefore, construct a Tobit model for the invested amount of money. We will discuss 1) the data, 2) the ML (Maximum of Likelihood) estimates of the Tobit model, after that we will see in 3) a comparison with the results obtained if we use the truncated sample approach rather than censured approach. 1) The data We consider data that were collected in a marketing campaign for a new financial product of a commercial investment firm (Robeco).The campaign consisted of a direct mailing to customers of the firm. The firm is interested in identifying characteristics that might explain which customers are interested in the new product and which ones are not. In particular, there may be differences between male and female customers and between active and inactive customers (where active means that the customer already invests in other products of the firm). Also the age of customers may be of importance, as relatively young and relatively old customers may have less interest in investing in this product than middle aged people. GDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 13
14. The data set consists of 925 individuals, of whom 470 responded bymaking an investment in the product and 455 did not respond. Forindividuals who responded, the amount of money invested in this productis known. The explanatory variables (gender, activity, age) are known forall 925 individuals, hence also for the individuals that did not invest in theproduct. So the dependent variable is censored, not truncated. As before,we take as dependent variable yi = log (1 + invest), where „invest‟ is theamount of money invested. For individuals who did not invest (so that„invest‟ is zero), we get yi = 0.2) The ML (Maximum of Likelihood) estimates of the Tobit modelThe Tobit estimates (ML in the censored regression model) are in Panel 2of Figure of results. For comparison this table also contains the OLSestimates that are obtained if the censoring is erroneously neglected (seePanel 1).The Tobit multipliers in Panel 2 of Figures of results aresomewhat larger than the OLS (least squares or ordinary least squares)multipliers. The variables „gender‟ and „activity‟ have a positive effect onthe amount of money invested, and age has a parabolic effect, with amaximum at an age of around 53 years (namely, where 0,196 – 2*0,185*(age=100)= 0).3) Comparison of Tobit estimates with results for truncated sampleIf we compare the results of the Tobit model in Panel 2 of Figures ofresults with the results for the truncated sample (without Tobit model)obtained if we use truncated model (see Panel 3). The effect of „activity‟now has the expected positive sign (instead of negative) and themaximum investments are around an age of 53 (instead of 62). Further,the Tobit estimates indicate higher investments by males as compared tofemales, whereas the reverse effect was estimated in the truncatedsample. As the information on individuals who do not invest is ofimportance in describing the general investment behaviour, the resultsobtained for the censored sample are more reliable than the ones for thetruncated sample. This illustrates the general point that it is alwaysadvisable to include relevant information in the model. The truncatedmodel neglects the information on non-investing customers, and thismakes this model much less informative than the Tobit model for thecensored data.4) Figures of resultsGDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 14
15. GDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 15
16. III. Mathematical definition of Tobit model types Amemiya (1984) classified Tobit models into five types based on the characteristics of the likelihood function. For notational convenience, let P denote a distribution or density function, assuming that y_ji is normally distributed with a mean of and a variance of j Type 1 Tobit The Type 1 tobit model, discussed in the preceding “Censored and Truncated Regression Models” section, is defined as Type 2 Tobit The Type 2 tobit model is defined as: GDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 16
17. WhereType 3 TobitThe Type 3 tobit model is different from the Type 2 tobit in that i ofthe Type 3 tobit is observed whenWhere ; (i.i.d.) means independent and identicallydistributed.Type 4 TobitThe Type 4 tobit model consists of three equations.Where (i.i.d.) means independent and identicallydistributedGDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 17
18. Type 5 Tobit The Type 5 tobit model is defined as Where are from iid trivariate normal distribution.IV. Overview on trend corrections of Tobit model The Tobit model was introduced by James Tobin in 1958 in order to model a specific type of discrete-continuous data commonly found in economic applications. The Tobit model is a specific case of a censored regression model and assumes that the continuous component of the data (right-tail) is normally distributed. Early examples included modeling household expenditures of luxury goods, inheritance, and expected age of retirement. But now it is used everywhere when selection bias is possible. However, it has been demonstrated in later research that even small departures from underlying normality assumption may lead to inconsistent estimators. Arabmazar and Schmidt (1982) explored the robustness of the Tobit estimator when estimating a population mean when the assumption of normality is violated. They concluded that the bias can be quite large and that the bias is dependent on the proportion of censoring.One technique that is often utilized in an attempt to compensate for this weakness in the case of long-tailed distributions is to apply a log transformation to the data. Lorimer and Kiermeier (2007) conduct a simulation study to examine the use of Tobit models on log-transformed microbiological data. They compared the Tobit method to two other methods, using only uncensored observations, and using the limit of detection for the censored values. They concluded that the two standard methods led to biased estimates and that the Tobit model led to less GDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 18
19. biased estimates. However, their conclusions are based on the underlyingassumption of normality. Others have discussed that usually tobit modelleads to non linear and complicated equations which cannot be solvedeven by mathematicians.Therefore, for each condition of profit models, whose solutions manytimes become nonlinear and complicated, they are trying to use simpleand heuristic models for each case.It is important to notice that Tobit models are the basic models inidentifying selection bias, and next models as Heckman models are somespecial case of Tobit model. It is why there are still wide researches inthis fields and many new trends has been discovered. Besides, there is ahuge trend which try to develop classical solution for each of thesetrends.The Relation between Tobit model and Heckman model: Tobit model become in an particular case as Probit model, which is thefirst stage of Heckman model, hence we can use the same way ofsimulating Tobit model as in the first stage for Heckman. Although Tobitmodel was introduced 23 years after Probit model, the last one wereintroduced by Chester Ittner Bliss in 1935, and it‟s fast method of solvingthe models was introduced by Ronald Fisher in an appendix to the samearticle. Because the response is a series of binomial results, the likelihoodis often assumed to follow the binomial distribution. Let Y be a binaryoutcome variable, and let X be a vector of regressors. The probit modelassumes thatwhere Φ is the cumulative distribution function of the standard normaldistribution. The parameters β are typically estimated by maximumlikelihood.While easily motivated without it, the probit model can be generated by asimple latent variable model. Suppose thatGDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 19
20. where , and suppose that Y is an indicator for whether the latent variable Y * is positive: In next part we will see it is the foundation of Heckman model in detail.V. Some References Amemiya, Takeshi (1973). "Regression analysis when the dependent variable is truncated normal". Econometrica 41 (6), 997–1016. Amemiya, Takeshi (1984). "Tobit models: A survey". Journal of Econometrics 24 (1-2), 3-61. Amemiya, Takeshi (1985). "Advanced Econometrics". Basil Blackwell. Oxford. Schnedler, Wendelin (2005). "Likelihood estimation for censored random vectors". Econometric Reviews 24 (2),195–217. Tobin, James (1958). "Estimation for relationships with limited dependent variables". Econometrica 26 (1), 24–36. Econometric methods with applications in business and economics Par C. Heij, Paul de Boer, Philip Hans Franses, Teun Kloek, Herman K. Van Dijk Oxford University Press, 2004 GDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 20
21. Heckman Model:We saw how there may be self selection by the individuals or dataunits being investigated. Her we review one example and try to modelit by Heckman primary model which as an econometric model.Suppose we observe that college grades are uncorrelated with success ingraduate school .Can we infer that college grades are irrelevant? Ofcourse not, but we should figure out that unmeasured variables (e.g.motivation) used in the admissions process might explain why those whoenter graduate school with low grades do as well as those who entergraduate school with high grades. Formulating the problem: Solving problemIf we continue and solve this model we will have:GDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 21
22. Here, will be our new repressor and V1i will be new specifics errorsin our example is motivation Y1 = the result of students at university X1 = undergraduate scores Y2 = Admission at graduate school X2= All the factors that create admission result I: Number of observations For being present at regression sample, you should be admitted or Y2> 0 References: Heckman, James J. 1979. “Sample Selection Bias as a Specification Error.” Econometrica 47(1): 153-161 W.H. Green , Econometrical Analysis, third editionGraphical explanation of Heckman Model:As we discussed, selection bias simply discusses only about the sampleswho satisfies a primary condition. We can show it graphically as belowwe only can see the parts in which r is below 0, while those r who are notrepresented in our surve,y can show different characteristics.GDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 22
23. The example which we will study here is about application of Heckmanmodel in assurance. We will apply the same probabilistic condition, butsince we have a binary situation, we will not enter to regressions, andsolve our problem with probability theories.Auditing policies derived from statistical analysis and applied to insuranceclaims face a major selection bias problem. Most insurance companies arereluctant to carry out a random auditing policy. This is because the long-term influence of an audit decision on the policyholder‟s value for thecompany is negative. Indeed, an honest policyholder may take the auditprocess amiss and his loyalty to the company should decrease as aconsequence, as well as his value for the insurer. Hence companies aredeterred from performing a systematic audit on part of their claimsdatabase. In fact they imply auditing only on suspicious files. In This waywe have a selection bias: we select those who are suspicious. In order toGDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 23
24. identify fraudulent insurance claims at an early stage, most systems score anew incoming claim using a set of fraud indicators. If the score is highenough, then the claim is audited and fraud (or abuse)maybe confirmed.Only claims with a high suspicion level (or score) are selected forinvestigation.Let us now formalize the selection bias issue. If all incoming claims canbe considered for audit, the selection bias issue can be formalized in thefollowing way: A and Fdenote the binary variables related to audit andfraud, and x is the vector of variables which describe the claim. Astatistical model assessing fraud risk is derived from the audited claimsand estimates probabilities of the type P(F = 1 | A= 1, x) = E(F | A= 1, x).Now an audit policy induced by this model is applied on the incomingclaims, and uses the probabilities P(F = 1 | x) = E(F | x). Selection bias isa consequence of the confusion between the conditional andunconditional probabilities.Here S represents the suspected files. In normal auditing, usually thenumbers of audited files (A) is less than S, because we do not audit everyfile and only evaluate suspected file where S>0.Random auditing of claims is the basic strategy which makes it possibleto counteract selection bias. A pure random auditing strategy consists inpicking claims at random, then in auditing these claims. This controlledexperiment eliminates the selection induced by the audit decision. Theestimation of a single fraud equation in this sample provides an estimatedfraud probability for incoming claims which is not subject to selectionGDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 24
25. bias. Here we calculate the parameters for both random and usualnonrandom samples. You can consider the difference in two cases. Experimental Models:Application of Heckman Model:After introducing Heckman Model, it has been widely diffused instatistical tests. While, searching for an example of Heckman model, weobserved that it has turned to be a common and necessary test ineconomics application. In other words, doing Heckman test is not anadvantage for an article, but not doing it accounts a weakness for theresearch. Most of the research has used Heckman test as a criteria forshowing the validity of their test. Sometimes it is used with othermethods to show the possible variance of answers. Like example below:Nous volons mesurer les composantes de productivité de secteurbancaire:ln Y = B0 + B1M-O + B2M-O2 + B3Sprfc + B4Ann´ee + Bri RI + eo`u Y est la mesure de production, M-O est la mesure de main-d‟œuvre,Sprfc celle du capital, Année est une mesure binaire différenciant 1995 etGDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 25
26. 1996et RI est l‟information binaire sur la présence des intéressements.In this article they continue to calculate coefficients with Heckmanmethod and Instrumental variable method to remove selection bias asbelow:GDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 26
27. As you can see the results of three methods, are not the same and theauthors continue to reasoning to identify correct range of answers1.As you can see, here authors necessarily do not think that Heckman‟smethod is the final methodology. It is only regarded as a mean toapproaching the problem from other dimensions. In fact the real problems1 Simon Drolet, Paul Lanoie, Bruce Shearer , Analyse de l’impactproductif des pratiques de rémunération incitative pour une entreprise deservices : Application à une coopérative financière québecoise,1999,CIRANOGDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 27
28. are often more complex than what we saw in beginning with Heckman‟sprimary article. There are fields that Heckman„s model is proved to becomplicated and useless. Instead there are other fields that differentresearches have provided the bases for using Heckman‟s model as aclassical and confident solutions to the special kind of problems whichexist in. As we can see in next chapter, there are other researches whichhave tried to exploit the quantitative research to provide practical hintsfor researcher in qualitative methods.Experimental Methodes: We saw TOBIT and Heckman‟s methods which are based on puremathematical approach. However, after their primary article, the selectionbias came into the attention of different fields in social science. At thebeginning, this job was excluded to econometricians, and they weretrying to provide insight for other disciplines, but later other branches ofhuman science, though did not have the Excellency of econometricians inGDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 28
29. dealing with mathematics problems, came to work and facilitatedintroduction of experimental methods.As we saw in previous sections, standard econometrical method forevaluating social programs uses the outcomes of participants to estimatewhat nonparticipants would have experienced had they participated. Thedifference between participant and nonparticipant outcomes is theestimated gross impact of a program reported in many evaluations. Theoutcomes of nonparticipants may differ systematically from what theoutcomes of participants would have been without the program,producing selection bias in estimated impacts. A variety of non-experimental estimators adjust for this selection bias under differentassumptions. Under certain conditions, randomized social experimentseliminate this bias.(Heckman, 1997)Here as one of the most important experiment model we study propensityScore model and continue our discussion.GDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 29
30. Propensity score matchingIntroduction:The probability of selection into a treatment, also called the propensityscore, plays a central role in classical selection models and in matchingmodels (see,e.g., Heckman, 1980; Heckman and Navarro, 2004;Heckman and Vytlacil, 2007; Hirano et al., 2003; Rosenbaum and Rubin,1983). Heckman and Robb (1986, reprinted 2000), Heckman and Navarro(2004) and Heckman and Vytlacil (2007) show how the propensity scoreis used differently in matching and selection models. They also show that,given the propensity score, both matching and selection models are robustto choice-based sampling, which occurs when treatment group membersare over- or under-represented relative to their frequency in thepopulation. Choice-based sampling designs are frequently chosen inevaluation studies to reduce the costs of data collection and to obtainmore observations on treated individuals. Given a consistent estimate ofthe propensity score, matching and classical selection methods are robustto choice-based sampling, because both are defined conditional ontreatment and comparison group status.Hence, in statistics, propensity score matching (PSM) is one of quasi-empirical “correction strategies” that corrects for the selection biases inmaking estimates.Generally, PSM is for cases of causal inference and simple selection biasin non-experimental settings in which: (i) few units in the non-experimental comparison group are comparable to the treatment units;and (ii) selecting a subset of comparison units similar to the treatmentGDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 30
31. unit is difficult because units must be compared across a high-dimensional set of pretreatment characteristics.In another word, the propensity scoring method is usually applied undertwo situation.Selection bias: potential bias from treatment assignment/selectionconditional on observed variables, due to the effects of unobservedvariables, controlled with selection into treatment.Finite data: sample size reduces our ability to estimate causal effects byconditioning on observed variables.But the PSM is only to adjust for (but not totally solve the problem of)selection bias; and to minimize the limitation from matching on manyobserved variables on finite data.In this section, we will introduce an empirical case to present how thePSM works.Web surveys are the most economical way to make social and marketsurveys, but selection bias can invalidate the obtained results. The mainsource of bias derives from the internet access coverage: if even in theUSA the web surveys on the elderlypopulation (50 years old or more) can have strong risks of biased results(Couper et al,2007), the greater digital divide – in Italy the 89.2% of thepopulation with more than 50 years do not have Internet access (Istat,2005) – suggests caution in using web surveys in our country. Taking intoaccount that in many social or market surveys the target population maynot coincide with general population it can be interesting to applymethods to correct web survey results when surveys are conducted ontopics related to niche interests.GDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 31
32. The first step is to well define the target population on the base of thetopic under investigation. Only a subset (unknown in the size) of thispopulation can have web access: but, using a national survey, it ispossible to quantify and qualify the subset of the target population withno web access. This data can be used to correct the web survey results.The second step is in having an emails list to which submits the websurvey.Following a proposed classification (Romano et al, 2006) email lists candiffer for accuracy and reasons of enrollment. There are many thematicweb sites where, to be enrolled, people go through authentication (loginand password). These email lists are surely an interesting starting pointto realize web surveys: the interest in the topic becomes itself a favorableelement for a good result of the survey in terms of response rate (amongothers, Olson, 2006).The third step is in applying methods in order to correct ex post theselection bias deriving from web access. Among other methods proposedin the literature, the propensity score technique, originally proposed tocorrect selection bias in health observational study (Rosenbaum et al,1983), has been recently used to correct selection bias deriving fromnonprobabilistic sampling (Terhanian et al, 2001) and/or web survey(Schonlau et al, 2004).GDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 32
33. We discuss preliminary results obtained applying this technique to thedata collected through a web survey on enrolled people to a well-knownenogastronomic web site.Apart from the survey’s goals, in this work we take into account a targetpopulation defined as Italian people from 18 to 78 year-old and focus onthe estimate of the proportion of people who went on holiday. For theapplication of propensity score method, data used is composed by asubset of web survey respondents (nw=4,128) and a subset of IstatMultiscopo survey (nm=37,677). A logistic regression was performedusing as dependent variable survey indicator (0=Multiscopo; 1=Websurvey) and as regressors gender (2 levels), level of education (3 levels),geographical areas (5 levels), age (6 levels). All variables weresignificant with α<0.04; respondents for both surveys were divided in 5and then in 10 bins according to the propensity scores. The propensityweights were applied to the web survey results, as shown in Table 1.The results are quite impressive considering the digital divide of theItalian population (only 31.8% has internet access): the differencebetween weighted and not weighted web results is almost 20%. Takinginto account that this is an intentionally simple exercise, we want tohighlight that more useful results can be obtained when target populationis tailored on specific survey’s goals and/or when the digital divide is notso dramatically high.GDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 33
34. Certainly, the application before is a just a simple case, here we onlyprovide a brief logic to show how the propensity score method correctsthe selection bias.Also we should be clear another fact is PSM has the several limitations.There are:Large samples are required; Group overlap must be substantial; Hiddenbias may remain because matching only controls for observed variables(to the extent that they are perfectly measured). (Shadish, Cook, &Campbell, 2002)Analysis:We saw propensity score matching model as a method which byimplementing controlled experiment identifies the mechanism ofselection in our estimation and gives us direction to adjust ourestimations. However, Social experiments are costly and the identifyingassumptions required to justify them are not always satisfied.Nonetheless, it is widely held that there is no valid alternative toexperimentation as a method for evaluating social programs (see, e.g.,Burtless, 1995). There are other methods which combines experimentaland mathematical models to identify and remove selection bias. In animportant paper, LaLonde (1986) combines data from a social experimentwith data from nonexperimental comparison groups to evaluate theperformance of many commonly used nonexperimental estimators. Forthe particular group of parametric estimators that he investigates, and forhis particular choices of regressors, he concludes that the estimatorschosen by econometric model selection criteria produce a range of impactestimates that is unacceptably large2. STOLZENBERG R. M. andRELLES in their article, try to extract intuitive insights for non-2 Characterizing Selection Bias Using Experimental Data,James Heckman, HidehikoIchimura, Jerey Smith, August 1997GDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 34
35. econometricians and qualitative researchers. They mention that theyprovide mathematical tools to assist intuition about selection bias inconcrete empirical analyses. These new tools do not offer a generalsolution to the selection bias problem3; and we can claim that our work inthis project is much more intuitive for non-econometricians then theirs.However, their statements gave us the direction to correct our ways, andregarding to frequent citations, we understood they have originated ahuge wave about understanding the limitation of Heckman models andtheir solutions. They mentioned that Heckman model necessary does notimprove the solution, and it only can be used with other examples todiscuss the range of validity of problems. Something that we discussedbefore. Besides, they mentioned that Heckman and other mathematicalmodel usually do not provide enough tuitions and directions forcorrecting answers, and they insist to implement different experimentsand exploit researchers own insight for detection and adjustingselection bias problems. Heckman, himself in his final articles works onpapers which identify selection bias by social experiments, then measuresthe accuracy of his experiment by econometrical models.4Conclusion:In this paper, we introduced and identified important types of selectionbias. As the most important models we studied Tobit and probit models,and identified Heckman model as a specific case of tobit model. Wefound an overview of different types of Tobit model, and their possiblesolution and conditions. Specially we studied Tobit 1 model in detail. we3 STOLZENBERG R. M. and RELLES D. A., Tools for intuition about sample selectionbias and its correction , American sociological review,Vol 62,N°3,19974 Characterizing Selection Bias Using Experimental Data,JamesHeckman, Hidehiko Ichimura, Jerey Smith, August 1997GDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 35
36. acknowledged that Tobit model has conditions which cannot be easilysatisfied, and nonlinear equations which are not always easy to solve.Then we studied heckman model as an special case in selection bias.Finally, we argued that there is not a confident and comprehensiveanswer to selection bias, and regarding to the complexity of the relations,different methods should be used, or avoided. In some papers it wasdiscussed that in some circumstances Heckman model should be avoided,and oppositely in some areas the efficiency of Heckman model was is soproved , that researchers have developed classical solution for some kindof problems. Besides, today, selection bias is so diffused in econometricsthat has turned to be a necessary part of researches. However, there is agrowing trend in using experimental methods in identifying andcorrecting selection bias. In this respect we studied propensity scorematching as a way of identifying selection bias by experiment, in controlgroups, and adjusting our data selection and treatment and selection toremove selection bias.In addition, we saw how researchers are approaching question withdifferent experimental and econometrical methods, specially usingconditional probabilities. They calculate with different methods therelated parameters and discuss about different response and theircredibility‟s. Once getting a generalization about some special case, theresearchers identify it as a classical solution, and in futures use them asstandard test.Likewise, for each condition of probit models, whose solutions manytimes become nonlinear and complicated, they are trying to use simpleand heuristic models for each case. According to what we have learnedthrough our studding in this project, we can mention that If you are aGDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 36
37. researcher in management and want to evaluate selection bias, our firstadvice is that start it with a mathematicians. Because still the tools are notindependents freom mathematics and econometrics is the mainknowledge creation body. The prove of our claim is the high level ofmathematics in papers who cintende to give qualitative insights. It is whywe saw some workshops and courses in USA and UK universities, likeoxford which were specially developed to teach selection bias to studentsof Human science. However, control experiments can be very useful toidentify and adjust selction bias for qualitative researchers. Besides,oneshort-cut way can be finding an article either in econometrics or yourrelated discipline who already has worked on your subject. In otherwords, look if you can find classical solutions to your problems.GDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 37
38. APPENDIX:Software and Toolkits: Short Toolkit for using Tobit modeling with EasyReg 1) IntroductionThe use of Tobit Model is very difficult whit mathematical approach,there is many type of Tobit model type and several hypothesis to considerin order getting the last result using Tobit Model. We suggest you in thissection a short summary of using the EasyReg software for modelingTobit model starting from you gathered empirical data. 2) The dataThe data has been generated artificially as follows. The independentvariables X1,j and X2,j and the error Uj for j = 1,....,n = 500 have beenGDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 38
39. drawn independently from the standard distribution, and Y has beengenerated as: Y =max(0,X1,j + X2,j + Uj).Thus, if an intercept is included in the model, so that the vectors ofregressions are Xj = (X1,j,X2,j,1),then the true parameter vector is = ( , , ), where 1 2 3 =1 1 =1 2 =0 3Moreover, the true value of is =1The data file involved is Tobit_Data.TXT, like this arrayObservationYX1X2Z 1 0.000000000 -1.463631868 -0.640421391 0 2 0.000000000 0.427667916 -0.219542548 0 which is in former EasyReg default format This data file also contains avariable Z, which I will use and explain later. (see Guided tour onimporting data files in EasyReg space delimited text format at EasyRegsoftware book). 3) How to estimate a Tobit model with EasyRegNow open "Menu > Single equation models > Tobit models" in theEasyReg main window, select the variables Y, X1 and X2, and keep thethe default intercept, similar to running an OLS regression with intercept,until you arrive at the following window.GDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 39
40. In general there is no need to adjust the stopping rules of the Newtoniteration which is used to maximize the likelihood function. Thus, click"Tobit analysis". Then after a few seconds the maximum likelihoodestimation results appear:If you click "Continue", the module NEXTMENU will be activated:GDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 40
41. You have seen this window before after running an OLS regression, so nofurther explanation is necessary.The output is listed below. Note that I have used the option "Wald test oflinear parameter restrictions" to test the joint null hypothesis: =1 1 =1 2 =0 3 This hypothesis is not rejected, of course, at any reasonable significance level 4) The outputTobit model:y = y* if y* > 0, y = 0 if y* <= 0, where y* = bx + uwith x the vector of regressors, b the parameter vector,and u a N(0,s^2) distributed error term.Dependent variable:Y=YCharacteristics:YGDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 41
42. First observation = 1Last observation = 500Number of usable observations: 500Minimum value: 0.0000000E+000Maximum value: 5.4575438E+000Sample mean: 7.2127526E-001This variable is nonnegative, with 244 zero values.A Tobit model is therefore suitableX variables:X(1) = X1X(2) = X2X(3) = 1Frequency of Y = 0: 48.80%(244 out of 500)Newton iteration succesfully completed after 5 iterationsLast absolute parameter change = 0.0001Last percentage change of the likelihood = 0.0603Tobit model: Y = max(Y*,0), withY* = b(1)X(1) + b(2)X(2) + b(3)X(3) + u,where u is distributed N(0,s^2), conditional on the X variables.Maximum likelihood estimation results:Variable ML estimates (t-value) [p-value]x(1)=X1 b(1)= 1.0547731 (17.0084) [0.00000]x(2)=X2 b(2)= 0.9905518 (15.2253) [0.00000]x(3)=1 b(3)= -0.0243418 (-0.3450) [0.73011]standard error of u s= 1.0635295 (21.9209) [0.00000][The p-values are two-sided and based on the normal approximation]Log likelihood: -4.74065017126E+002Pseudo R^2: 0.60984Sample size (n): 500GDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 42
43. Information criteria: Akaike: 1.912260069 Hannan-Quinn: 1.925490511 Schwarz: 1.945976933If the model is correctly specified then the maximum likelihoodparameter estimators b(1),..,b(3), minus their true values, times thesquare root of the sample size n, are (asymptotically) jointly normallydistributed with zero mean vector and variance matrix: 1.92290870E+00 6.77554263E-01 -9.38221607E-01 5.37455447E-01 2.11638376E+00 -9.79444588E-01-9.81136382E-01 -1.09217153E+00 2.48931672E+00Wald test:x(1)=X1 b(1)= 1.0547731 (17.0084)(*)x(2)=X2 b(2)= 0.9905518 (15.2253)(*)x(3)=1 b(3)= -0.0243418 (-0.3450)(*)(*): Parameters to be testedNull hypothesis:1.x(1)+0.x(2)+0.x(3) = 1.0.x(1)+1.x(2)+0.x(3) = 1.0.x(1)+0.x(2)+1.x(3) = 0.Null hypothesis in matrix form: Rb = c, whereR= 1. 0. 0. 0. 1. 0. 0. 0. 1.and c = 1. 1. 0.Wald test statistic: 0.98Asymptotic null distribution: Chi-square(3) p-value = 0.80630 Significance levels: 10% 5% Critical values: 6.25 7.81 Conclusions: accept acceptGDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 43
44. 5) An inappropriate attempt to conduct Tobit analysisAs an example of a case for which EasyReg refuses to conduct Tobitanalysis, select the variables Z, X1, X2 and the constant 1 for theintercept, and declare Z the dependent variable. Then you will get stuckhere:The problem is that Z is discrete, because I have generated it as Z = Int(100*Y)where the "Int" function trucates its argument to an integer, by cutting offall the digits after the decimal symbol (a dot "." in the US, a comma "," inEurope). But the Tobit model assumes that Z has a continuousdistribution, conditional on Z > 0 and X1 and X2, so that the assumptionsof the Tobit model do not hold. Therefore, in order to prevent you fromdoing bad econometrics, EasyReg will not allow you to continue.In view of the queries I have gotten about this issue, the message in thiswindow may not be clear enough. If so, click the "Yes" button, whichopens a PDF file:GDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 44
45. 6) What to do if the dependent variable Y is confined to a bounded interval? a. The case Y (a,b]If the observed dependent variable Y is confined to an interval (a,b],where -< a < b < with P[Y = b] > 0, it is possible to transform Y to a ,new dependent variable Z, say, such that Z [0, and P[Z = 0] = P[Y = )b] > 0, namely Z = -ln[(Y - a)/(b - a)]. Next, assume that Z = max(0,Z*),where Z* = + U. Then X Y = min(b,a + (b - a)exp(-Z*)) = min(b,a + (b - a)exp(- - U)). XTo create this variable Z, open Menu > Input > Transform variables, andconduct the following transformations: 1. Click the "Constant = 1" button. Then a new variable "1" is created, which has the value 1 for all observations. 2. Click the "Linear combination of variables" button, select "1" and use the value of a as coefficient. Then a new variable with name "ax1" is created, which has the value a for all observations. I will assume that you have renamed the variable "ax1" as variable A. 3. Click the "Linear combination of variables" button, select "1" and use the value of b as coefficient. Then a new variable with nameGDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 45
46. "bx1" is created, which has the value b for all observations. I will assume that you have renamed the variable "bx1" as variable B. 4. Click the "Linear combination of variables" button, select the variables Y and A, and create the linear combination Y-A. I will assume that you have renamed Y-A as YminA. Note that now YminA (0, -a]. b 5. Click the "Linear combination of variables" button, select the variables B and A, and create the linear combination B-A. I will assume that you have renamed B-A as BminA. Note that BminA is a contant with value b-a for all observations. 6. Click the "Multiplicative transformation of variables" button, select the variables YminA and BminA and use the powers 1 and -1, respectively, to create the new variable "YminA x BminA^-1". I will assume that you have renamed this new variable as YminA/BminA. Note that YminA/BminA (0,1]. 7. Click the "LOG transformation: x -> ln(x)" button, and select the variable YminA/BminA. Then the new variable LN[YminA/BminA] will be created. Note that LN[YminA/BminA] ( -,0]. 8. Click the "Linear combination of variables" button, select the variable LN[YminA/BminA], and use the coefficient -1 to create the variable -LN[YminA/BminA]. I will assume that you have renamed this variable as Z. Thus, Z = -LN[YminA/BminA]. Now Z [0,), and P[Z = 0] = P[Y = b] > 0.The new variable Z in step 8 can now be used as dependent variable in aTobit model. However, keep in mind that in this case a negativecoefficient of an X variable implies a positive effect on the originaldependent variable Y, because Y = -1/(Y-a) < 0, hence Z < 0. Z/ Y/Although needless to say (but I will say it anyhow), if a = 0 and b = 1then you can skip the steps 1 to 6, and use Y instead of YminA/BminA instep 7. b. The case Y [a,b)If Y [a,b), where -< a < b < with P[Y = a] > 0, then Z = -ln[(b - ,Y)/(b - a)] [0, with P[Z = 0] = P[Y = a] > 0. This variable Z can be ),created similarly to the previous steps 1 to 8, and can be used as the newdependent variable in a Tobit model. Since now Z > 0, a positive Y/coefficient of an X variable implies a positive effect of this X variable onY.GDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 46
47. Note that now we model the conditional distribution of Y by Y = max(a,b - (b - a)exp(-Z*)) = max (a,b - (b - a)exp(- - U)). X c. The case Y [a,b]This case cannot be handled by standard Tobit analysis.You can find some specialized software in link below:The widest variety of sample selection models. Information regardingLIMDEP can be found at www.limdep.com. Also, a student version,along with documentation, can be downloaded free fromww.stern.nyu.edu/~wgreene/Text/econometricanalysis.htm. The studentversion of this software and accompanying data sets are included withGreene‟s (2000) text.For SAS users, Jaeger (1993) provides the code for performing Heckman‟stwo-step estimation of sample selection bias. This program can bedownloaded from the SAS Institute web page using the following link(http://ftp.sas.com/techsup/download/stat/heckman.html). Some adjustmentsto the code are necessary for the program to work (i.e., your own variablenames must be inserted).Finally, Stata 7 (2001) (http://www.stata.com/site.html) also can beused to estimate Heckman‟s (1976, 1979) two-step detection andcorrection of sample selection bias. Specific programming informationcan be found at the following Stata linkGDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 47
48. (http://www.stata.com/help.cgi?heckman).In most software like Stata , SAS, and SPSS, there are special tests forthis problems.If you search at help of these software with Heckman,Tobit, selection bias, and more importantly endegenity you will find yourrelated syntax which will simplify you work significantly.For stata we observed many times that it is quite defined there and couldusually be identified and applied. Here you can see the result of thesearch in help of stata:-------------------------------------------------------------------------------search for selection bias (manual: [R] search)------------------------------------------------------------------------------- Keywords: selection bias Search: (1) Official help files, FAQs, Examples, SJs, and STBs (2) Web resources from Stata and from other usersSearch of official help files, FAQs, Examples, SJs, and STBs[R] heckman . . . . . . . . . . . . . . . . . . . Heckman selection model (help heckman)[SVY] svy: heckman . . . . . . . . Heckman selection model for surveydataGDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 48
49. (help svy: heckman)FAQ . . . . . . . . . . . . . . . Endogeneity versus sample selection bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D. Millimet 10/01 What is the difference between endogeneity and sample selection bias? http://www.stata.com/support/faqs/stat/bias.htmlFAQ . . . . . . . . . . . . . . Determining the sample for a Heckman model . . . . . . . . . . . . . . . . . . . . . . . V. Wiggins and W. GouldBesides, in stata, if you search with good phrase you may be linked byweb to other users who have worked on your subject. As well as you canask from the related center who develops this software, and ask yourspecific questions.GDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 49
50. GDO Master Oualid El Ouardi—Xu Bin—Shabnam Kazempur Page 50
Be the first to comment