Labeling Data


Published on

1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Labeling Data

  1. 1. How to Use STATA for Windows Wei Sun Introduction This class is designed to acquaint you with the basic use of STATA for Windows. We will learn the specifics of two types of statistical tests in this class. Familiarity with IBM PCs, Windows, and a basic knowledge of statistics is required, but it is assumed that you have no prior experience with other versions of STATA. What is STATA? STATA is a statistical software package for managing, analyzing, and graphing data. STATA for Windows, Release 5.0 is the most recent windows-based version of STATA, although it also has versions for Macintosh, DOS and UNIX. STATA for Windows is a very powerful program that can perform a wide variety of statistical techniques, such as the t-test, analysis of variance, factor analysis, and multivariate regression. Overview In the course of this class, we will be using STATA to do the following tasks: Exploring the Data The data is loaded into the memory and is viewed using the Data Editor. Descriptive statistics are found in a table by typing simple commands. Conducting a T-Test A t-test can be run in STATA on two of the variables to test whether the two samples came from populations with the same mean. Running a Regression A simple linear regression is run in STATA to test whether a linear relationship exists between two of the variables. Generating plots using STATA will also be introduced. Selecting Data 1
  2. 2. Some of the data is removed from consideration, and another linear regression is run. Why Use STATA? The calculations required for even basic statistics can be very cumbersome and tedious to do manually or even with a programmable calculator. Many advanced statistical techniques are practically impossible to do without using a computer. A statistical package has a similar effect on data analysis as a word-processing package has on writing. Statistical packages can sometimes help you work more quickly, but the primary effect is that the quality of your work can be remarkably improved. Using a statistical package, you will get much more information out of your data, and you will be able to do more accurate and efficient analyses than if you were doing the analyses manually. STATA is a general-purpose statistical package for researchers of all disciplines. It has many versions for Windows, Macintosh, DOS and UNIX, making STATA widely accessible. Using only a handful of basic commands, STATA can easily help a researcher do data management and programming. Getting Started with STATA Follow this drive path to access the Stata program: Programs/Basic/Statistics and Math/Stata 5.0 . Getting Familiar with STATA for Windows STATA has four windows: Stata Command, Review, Variables, and Stata Results. The Stata Command window is where you type the commands. Past commands appear in the Review window and the variable list appears in the Variables window. You should see output in the Stata Results window. You can open the editor by clicking it on the title bar. The editor is like a spreadsheet, allowing you to do data entry. Finding Help STATA for Windows has an extensive help system that can help you quickly find information about most STATA commands. The help systems can be accessed by choosing Help from the main menu bar or by entering commands in the Command window. In the help system, you can choose between searching by topic with lookup topic or getting help for a Stata command. Help files for Stata commands contain the command’s syntax, description of the commands, options, examples and references. 2
  3. 3. Variables and Cases STATA uses data organized in rows and columns. Rows correspond to observations and columns to variables. An observation contains information for one unit of analysis, such as a person, an animal, a business, or a jet engine. Variables are the information collected for each case, such as age, body, weight, profits, or fuel consumption. For example, the following data is presented in a STATA editor spreadsheet: Make Price MPG Weight Gear Ratio VM Rabbit 4697 25 1930 3.78 Olds 98 8814 21 4060 2.41 AMC Concord 4099 22 2930 3.58 Here, each row represents the information of an individual car. Each column is a different variable, and its value in a particular row depends on the individual car. STATA accepts either numbers or characters as data. Each column can contain numbers or characters but not both. For example, the variable Make contains characters that name cars, while the variable Price contains numbers that describe the car's price. Manipulating Data with STATA STATA can read data in a text (ASCII) file, which is created by a spreadsheet or separated by spaces or any other format. Once you read the data into STATA and save it, the new file will be in STATA-specific format. Importing Data Stored in Spreadsheet Form There are a few things you need to know when you are doing your data entry in STATA editor program: • A period “.” represents a missing numeric value. • A variable name must be 1 to 8 characters long. The first character of a variable name must be a letter or an underscore. • STATA will not allow empty columns or rows in the middle of your data set. Opening a File 3
  4. 4. To open a file into Stata, click on the File/Open command and get into the dialog box. Locating the drives and directors, you will find the folder that contains your desired data file. Finally, you have to set the correct file format as “*.dta” and save it. Viewing, Editing, and Saving the STATA Data File The Data Editor window is the place where you can edit data values and type in new data. Inputting data In the Stata Editor window, you can navigate by clicking on a cell, or by using the arrow keys, the tab, and Enter. The environment behaves much like a standard spreadsheet program. You can input data variable-by- variable by using Enter after each value, or you can input data observation-by-observation by using Tab after each value. You can change data values by clicking the cell, typing the new value, and pressing Enter. The missing value can be inputted by pressing Tab or Enter or typing a period. To rename a variable, you can double-click anywhere in the variable’s column, which brings up the Variable Information Dialog. Then, you enter the new name of the variable. An important point to note is that the display formats for the variables are automatically formatted when you input data. Numerical data and characters have different formats. Once you replace a string number with a numeric number and try to do statistical analyses, the format won’t change correspondingly. To input data from files such as spreadsheets, text files and others, you can use the commands insheet, infile, or infix, respectively. For example, if you have text file data sets, type infile varname using filename in the Stata Command window. Labeling Data List command presents the data sets in the result window and can be done in one of the following ways: you can list one or more variable(s); you can list a certain observation by typing list in 1/3 to list observations one through three; or you can list data sets with "if." Describe command provides the information about the data sets: number of observations and variables, size, and data formats. 4
  5. 5. Label define is used to create a value label; label values is used to ascribe a label to a variable; the syntax for these two commands is: label define labelname # “contents” # “contents”; label values variablename labelname; Changing and viewing data If you want to change the value, just click on the cell you want to change and enter a new value. You can select the variables that appear in the Editor window by typing edit varname or edit if... or edit in 1/5 to use observations 1 through 5. Deleting Data There are two ways to delete data: 1 1. click on the Delete button on the editor 2 2. in command window, type drop varname or in #row or #column Creating new variables To create a new variable that is an algebraic expression of another variable, type generate newvar = expression; for example, if we want to create a logarithm for auto price, the syntax is: gen logpr=ln(price). In addition,the replace command allows you to change the contents of an existing variable. Saving Your Worksheet After you finish editing, exit the Editor window, pull down the File menu, and choose Save As. Or you can type save filename in the Command window. Exploring the Data As we will see, all of the different types of statistical analyses available in STATA are located under the main menu option “Statistical Analysis.” In this example, we are going to obtain some basic descriptive characteristics of the data using the STATA descriptive command. After you load the data file that you are going to analyze, type summarize in the Command window; this will result in a summary of all of the data, such as: variable name, number of observations, mean, standard deviations, minimum and maximum. You can also specify the results that you want by typing the variable name after the summarize command. The command tabulate varname produces a table of statistics, including frequency, percentage, and cumulative percentage. The correlate varname command produces the correlations among the variables that you choose. 5
  6. 6. Graphing Data Generating a scatterplot is often used as an intermediate step to check the data for anomalies or data entry mistakes. Because a graph provides you with a way to visually inspect all of the data at once, a scatterplot allows you to quickly check for trends, outliers, etc. After you type graph varnames, the Graph window appears, probably covering up the Results window. You can also create separate graphs in a Graph window. Here is the plot in STATA, describing the relationship between MPG and weight using the auto data set that from the tutorial file: To save the graph to disk, choose save graph from the File menu and enter the filename. To load the graph from a disk, type “graph using filename.” Performing a Test Running a T-test To conduct a t-statistical test, type ttest varname. This will produce a table with the mean, standard deviations, t-statistics, and sample size, which helps to determine whether or not to reject the null hypothesis. Using the example of auto data, we test the hypothesis that the average MPG of domestic and foreign cars are equal. By typing “ttest mpg, by (foreign)”, we obtain the following information: Variable | Obs Mean Std. Dev. 6
  7. 7. ---------+--------------------------------------- Domestic | 52 19.82692 4.743297 Foreign | 22 24.77273 6.611187 ----------+--------------------------------------- Combined | 74 21.2973 5.785503 Ho: mean(x) = mean(y) (assuming equal variances) t = -3.63 with 72 d.f. Pr > |t| = 0.0005 The probability value is 0.0005, which means that there is a 0.05% chance of obtaining sample means at least this far, or farther, apart if the null hypothesis is true. Since this is a very small probability, we can reject the null hypothesis of equal means and conclude that the average MPG of domestic cars tend to be lower than that of foreign cars. Dealing with Output You can save and print your output in a log file. The log file is a plain ASCII (text) file that can be printed from Stata or loaded into a text editor or word processor. 1. To start a log file, click on the Log... button that is on the menu bar and fill in a name for the file. This will open a standard file dialog box allowing you to specify a directory (i.e. c: or d:) and filename to hold your log. 2. Open the log window by choosing Log from the Window menu or clicking Log... and choosing Bring log window to top. 3. The log window looks exactly like the output you saw in the Results window. 4. To print an open file during a Stata session, pull down the File menu and choose Print Log. 5. Then choose Close the Log file in Log... and the output is automatically saved. 6. You can also append Stata results onto the existing log file or overwrite the existing log file. Running a Simple Linear Regression The Concept of Linear Regression The following plot shows the relationship between the weight and mileage of the car. When we do a linear regression, we hypothesize the existence of a linear relationship between two variables ( call x the independent variable, and y the dependent variable) of the form y=α+βx+ε, where α and β are constants, and ε is an error term. 7
  8. 8. Regression with STATA Using the auto data, we can model the relationship between MPG and weight. Based on the above graph, we determine the relationship to be nonlinear and will model MPG as a quadratic in weight. Thus, we estimate the model: mpg =β0 + β1 weight +β2 weight2 + ε (1) First of all, you need to create a new variable "weight2" by using the generate command and then type regress mpg weight weight2; the dependent variable must directly follow the regress command. The following are the linear regression results: Source | SS df MS Number of obs = 74 ---------+---------------------------------------- F( 2, 71) = 72.80 Model | 1642.52197 2 821.260986 Prob > F = 0.0000 Residual | 800.937487 71 11.2808097 R-squared = 0.6722 ---------+------------------------------ ---------- Adj R-squared = 0.6630 Total | 2443.45946 73 33.4720474 Root MSE = 3.3587 ---------------------------------------------------------------------------------------- -- mpg | Coef. Std. Err. t P>|t| [95% Conf. Interval] --------- +-------------------------------------------------------------------------------- weight | -.0141581 0.0038835 -3.646 0.001 -.0219016 -.0064145 wtsq* | 1.32e-06 6.26e-07 2.116 0.038 7.67e-08 2.57e-06 8
  9. 9. cons | 51.18308 5.767884 8.874 0.000 39.68225 62.68392 ---------------------------------------------------------------------------------------- -- * wtsq=weight^2 The columns labeled “t” and “P>|t|” are the probabilities for each of the variables that you regress mpg against, testing the null hypothesis of a nonlinear relationship. Since we want to test the null hypotheses for “weight” and “wtsq,” we look at the values for them. The significance levels given are 0.001 and 0.038, respectively, which means that the probability that there is not a linear relationship between mpg and weight as well as the probability of there not being a nonlinear relationship between mpg and wtsq is very small. Thus, we can reject the null hypothesis of there not being a linear relationship or not having a nonlinear relationship, and we can infer that there exists a linear relationship between mpg and weight and a nonlinear relationship between mpg and wtsq. Based on the above regression results, we rewrite the equation (1) as follows: mpg = 51.18303 - .0141581weight + 1.32e-06wtsq As you can see here, mpg, our dependent variable, “depends” on the value of the weight and wtsq. From this equation, we can predict an expected value for mpg, if we are given values of weight and wtsq. For example, given a weight of 2500, we can predict that this car has a mpg of 51.18303-2500*0.0141581+25002* 1.32e-06 = 24 Consider a car weighing 3000. This car’s mpg is predicted to be 51.18303-3000*0.0141581+30002* 1.32e-06 = 21, which denotes a lower mpg than 24. Plotting the Results We can now graph the data and the predicted curve by typing graph mpg mpghat weight: 9
  10. 10. mpg Notice how the data follows a general trend: lower weights for higher mpg. The line we have found is the “best fit” for this data set. However, this line is a curve which indicates a nonlinear relationship between mpg and weight. To save a graph, choose Save Graph in the File menu and click on it. A dialogue box will open to allow you to save the graph to a file called “*.gph”. Similarly, click on Print Graph in the File menu to print out the graph in STATA. Exiting STATA Once you have completed your session and saved all of the relevant windows, you can exit STATA by clicking on Exit from the File menu. 10