1.
How to Use STATA for Windows
Wei Sun
Introduction
This class is designed to acquaint you with the basic use of STATA for
Windows. We will learn the specifics of two types of statistical tests in
this class. Familiarity with IBM PCs, Windows, and a basic
knowledge of
statistics is required, but it is assumed that you have no prior experience
with other versions of STATA.
What is STATA?
STATA is a statistical software package for managing, analyzing, and
graphing data. STATA for Windows, Release 5.0 is the most recent
windows-based version of STATA, although it also has versions for
Macintosh, DOS and UNIX.
STATA for Windows is a very powerful program that can perform a wide
variety of statistical techniques, such as the t-test, analysis of variance,
factor analysis, and multivariate regression.
Overview
In the course of this class, we will be using STATA to do the following
tasks:
Exploring the Data
The data is loaded into the memory and is viewed using the Data
Editor. Descriptive statistics are found in a table by typing simple
commands.
Conducting a T-Test
A t-test can be run in STATA on two of the variables to test whether
the two samples came from populations with the same mean.
Running a Regression
A simple linear regression is run in STATA to test whether a linear
relationship exists between two of the variables. Generating plots using
STATA will also be introduced.
Selecting Data
1
2.
Some of the data is removed from consideration, and another linear
regression is run.
Why Use STATA?
The calculations required for even basic statistics can be very cumbersome
and tedious to do manually or even with a programmable calculator.
Many advanced statistical techniques are practically impossible to do
without using a computer. A statistical package has a similar effect on
data analysis as a word-processing package has on writing. Statistical
packages can sometimes help you work more quickly, but the primary
effect is that the quality of your work can be remarkably improved. Using
a statistical package, you will get much more information out of your
data, and you will be able to do more accurate and efficient analyses than
if you were doing the analyses manually. STATA is a general-purpose
statistical package for researchers of all disciplines. It has many versions
for Windows, Macintosh, DOS and UNIX, making STATA widely
accessible. Using only a handful of basic commands, STATA can easily
help a researcher do data management and programming.
Getting Started with STATA
Follow this drive path to access the Stata program:
Programs/Basic/Statistics and Math/Stata 5.0
.
Getting Familiar with STATA for Windows
STATA has four windows: Stata Command, Review, Variables, and Stata
Results. The Stata Command window is where you type the commands.
Past commands appear in the Review window and the variable list appears
in the Variables window. You should see output in the Stata Results
window. You can open the editor by clicking it on the title bar. The
editor is like a spreadsheet, allowing you to do data entry.
Finding Help
STATA for Windows has an extensive help system that can help you
quickly find information about most STATA commands. The help
systems can be accessed by choosing Help from the main menu bar or by
entering commands in the Command window. In the help system, you
can choose between searching by topic with lookup topic or getting help
for a Stata command. Help files for Stata commands contain the
command’s syntax, description of the commands, options, examples and
references.
2
3.
Variables and Cases
STATA uses data organized in rows and columns. Rows correspond to
observations and columns to variables. An observation contains
information for one unit of analysis, such as a person, an animal, a
business, or a jet engine. Variables are the information collected for each
case, such as age, body, weight, profits, or fuel consumption.
For example, the following data is presented in a STATA editor
spreadsheet:
Make Price MPG Weight Gear Ratio
VM Rabbit 4697 25 1930 3.78
Olds 98 8814 21 4060 2.41
AMC Concord 4099 22 2930 3.58
Here, each row represents the information of an individual car. Each
column is a different variable, and its value in a particular row depends on
the individual car.
STATA accepts either numbers or characters as data. Each column can
contain numbers or characters but not both. For example, the variable
Make contains characters that name cars, while the variable Price
contains numbers that describe the car's price.
Manipulating Data with STATA
STATA can read data in a text (ASCII) file, which is created by a
spreadsheet or separated by spaces or any other format. Once you read
the data into STATA and save it, the new file will be in STATA-specific
format.
Importing Data Stored in Spreadsheet Form
There are a few things you need to know when you are doing your data
entry in STATA editor program:
• A period “.” represents a missing numeric value.
• A variable name must be 1 to 8 characters long. The first character of a
variable name must be a letter or an underscore.
• STATA will not allow empty columns or rows in the middle of your
data set.
Opening a File
3
4.
To open a file into Stata, click on the File/Open command and get into
the dialog box. Locating the drives and directors, you will find the folder
that contains your desired data file. Finally, you have to set the correct
file format as “*.dta” and save it.
Viewing, Editing, and Saving the STATA Data File
The Data Editor window is the place where you can edit data values and
type in new data.
Inputting data
In the Stata Editor window, you can navigate by clicking on a cell, or by
using the arrow keys, the tab, and Enter. The environment behaves much
like a standard spreadsheet program. You can input data variable-by-
variable by using Enter after each value, or you can input data
observation-by-observation by using Tab after each value.
You can change data values by clicking the cell, typing the new value, and
pressing Enter. The missing value can be inputted by pressing Tab or
Enter or typing a period. To rename a variable, you can double-click
anywhere in the variable’s column, which brings up the Variable
Information Dialog. Then, you enter the new name of the variable.
An important point to note is that the display formats for the variables are
automatically formatted when you input data. Numerical data and
characters have different formats. Once you replace a string number with
a numeric number and try to do statistical analyses, the format won’t
change correspondingly.
To input data from files such as spreadsheets, text files and others, you
can
use the commands insheet, infile, or infix, respectively. For example, if
you have text file data sets, type infile varname using filename in the
Stata Command window.
Labeling Data
List command presents the data sets in the result window and can be done
in one of the following ways: you can list one or more variable(s); you
can list a certain observation by typing list in 1/3 to list observations one
through three; or you can list data sets with "if."
Describe command provides the information about the data sets: number
of observations and variables, size, and data formats.
4
5.
Label define is used to create a value label; label values is used to ascribe
a label to a variable; the syntax for these two commands is:
label define labelname # “contents” # “contents”;
label values variablename labelname;
Changing and viewing data
If you want to change the value, just click on the cell you want to change
and enter a new value.
You can select the variables that appear in the Editor window by typing
edit varname or edit if... or edit in 1/5 to use observations 1 through 5.
Deleting Data
There are two ways to delete data:
1 1. click on the Delete button on the editor
2 2. in command window, type drop varname or in #row or #column
Creating new variables
To create a new variable that is an algebraic expression of another
variable, type generate newvar = expression; for example, if we want to
create a logarithm for auto price, the syntax is: gen logpr=ln(price). In
addition,the replace command allows you to change the contents of an
existing variable.
Saving Your Worksheet
After you finish editing, exit the Editor window, pull down the File menu,
and choose Save As. Or you can type save filename in the Command
window.
Exploring the Data
As we will see, all of the different types of statistical analyses available in
STATA are located under the main menu option “Statistical Analysis.” In
this example, we are going to obtain some basic descriptive characteristics
of the data using the STATA descriptive command.
After you load the data file that you are going to analyze, type summarize
in the Command window; this will result in a summary of all of the data,
such as: variable name, number of observations, mean, standard
deviations, minimum and maximum. You can also specify the results that
you want by typing the variable name after the summarize command.
The command tabulate varname produces a table of statistics, including
frequency, percentage, and cumulative percentage. The correlate
varname command produces the correlations among the variables that you
choose.
5
6.
Graphing Data
Generating a scatterplot is often used as an intermediate step to check the
data for anomalies or data entry mistakes. Because a graph provides you
with a way to visually inspect all of the data at once, a scatterplot allows
you to quickly check for trends, outliers, etc.
After you type graph varnames, the Graph window appears, probably
covering up the Results window. You can also create separate graphs in a
Graph window. Here is the plot in STATA, describing the relationship
between MPG and weight using the auto data set that from the tutorial
file:
To save the graph to disk, choose save graph from the File menu and
enter the filename. To load the graph from a disk, type “graph using
filename.”
Performing a Test
Running a T-test
To conduct a t-statistical test, type ttest varname. This will produce a
table with the mean, standard deviations, t-statistics, and sample size,
which helps to determine whether or not to reject the null hypothesis.
Using the example of auto data, we test the hypothesis that the average
MPG of domestic and foreign cars are equal. By typing “ttest mpg, by
(foreign)”, we obtain the following information:
Variable | Obs Mean Std. Dev.
6
7.
---------+---------------------------------------
Domestic | 52 19.82692 4.743297
Foreign | 22 24.77273 6.611187
----------+---------------------------------------
Combined | 74 21.2973 5.785503
Ho: mean(x) = mean(y) (assuming equal variances)
t = -3.63 with 72 d.f.
Pr > |t| = 0.0005
The probability value is 0.0005, which means that there is a 0.05%
chance of obtaining sample means at least this far, or farther, apart if the
null hypothesis is true. Since this is a very small probability, we can reject
the null hypothesis of equal means and conclude that the average MPG of
domestic cars tend to be lower than that of foreign cars.
Dealing with Output
You can save and print your output in a log file. The log file is a plain
ASCII (text) file that can be printed from Stata or loaded into a text editor
or word processor.
1. To start a log file, click on the Log... button that is on the menu bar and
fill in a name for the file. This will open a standard file dialog box
allowing you to specify a directory (i.e. c: or d:) and filename to hold
your log.
2. Open the log window by choosing Log from the Window menu or
clicking Log... and choosing Bring log window to top.
3. The log window looks exactly like the output you saw in the Results
window.
4. To print an open file during a Stata session, pull down the File menu
and choose Print Log.
5. Then choose Close the Log file in Log... and the output is
automatically saved.
6. You can also append Stata results onto the existing log file or overwrite
the existing log file.
Running a Simple Linear Regression
The Concept of Linear Regression
The following plot shows the relationship between the weight and mileage
of the car. When we do a linear regression, we hypothesize the existence
of a linear relationship between two variables ( call x the independent
variable, and y the dependent variable) of the form y=α+βx+ε,
where α and β are constants, and ε is an error term.
7
8.
Regression with STATA
Using the auto data, we can model the relationship between MPG and
weight. Based on the above graph, we determine the relationship to be
nonlinear and will model MPG as a quadratic in weight. Thus, we
estimate the model:
mpg =β0 + β1 weight +β2 weight2 + ε (1)
First of all, you need to create a new variable "weight2" by using the
generate command and then type regress mpg weight weight2; the
dependent variable must directly follow the regress command.
The following are the linear regression results:
Source | SS df MS Number of obs = 74
---------+---------------------------------------- F( 2, 71) =
72.80
Model | 1642.52197 2 821.260986 Prob > F =
0.0000
Residual | 800.937487 71 11.2808097 R-squared =
0.6722
---------+------------------------------ ---------- Adj R-squared =
0.6630
Total | 2443.45946 73 33.4720474 Root MSE =
3.3587
----------------------------------------------------------------------------------------
--
mpg | Coef. Std. Err. t P>|t| [95% Conf.
Interval]
---------
+--------------------------------------------------------------------------------
weight | -.0141581 0.0038835 -3.646 0.001 -.0219016
-.0064145
wtsq* | 1.32e-06 6.26e-07 2.116 0.038 7.67e-08 2.57e-06
8
9.
cons | 51.18308 5.767884 8.874 0.000 39.68225
62.68392
----------------------------------------------------------------------------------------
--
* wtsq=weight^2
The columns labeled “t” and “P>|t|” are the probabilities for each of the
variables that you regress mpg against, testing the null hypothesis of a
nonlinear relationship. Since we want to test the null hypotheses for
“weight” and “wtsq,” we look at the values for them. The significance
levels given are 0.001 and 0.038, respectively, which means that the
probability that there is not a linear relationship between mpg and weight
as well as the probability of there not being a nonlinear relationship
between mpg and wtsq is very small. Thus, we can reject the null
hypothesis of there not being a linear relationship or not having a
nonlinear relationship, and we can infer that there exists a linear
relationship between mpg and weight and a nonlinear relationship
between mpg and wtsq.
Based on the above regression results, we rewrite the equation (1) as
follows:
mpg = 51.18303 - .0141581weight + 1.32e-06wtsq
As you can see here, mpg, our dependent variable, “depends” on the value
of the weight and wtsq. From this equation, we can predict an expected
value for mpg, if we are given values of weight and wtsq. For example,
given a weight of 2500, we can predict that this car has a mpg of
51.18303-2500*0.0141581+25002* 1.32e-06 = 24
Consider a car weighing 3000. This car’s mpg is predicted to be
51.18303-3000*0.0141581+30002* 1.32e-06 = 21,
which denotes a lower mpg than 24.
Plotting the Results
We can now graph the data and the predicted curve by typing graph mpg
mpghat weight:
9
10.
mpg
Notice how the data follows a general trend: lower weights for higher
mpg. The line we have found is the “best fit” for this data set. However,
this line is a curve which indicates a nonlinear relationship between mpg
and weight.
To save a graph, choose Save Graph in the File menu and click on it. A
dialogue box will open to allow you to save the graph to a file called
“*.gph”. Similarly, click on Print Graph in the File menu to print out the
graph in STATA.
Exiting STATA
Once you have completed your session and saved all of the relevant
windows, you can exit STATA by clicking on Exit from the File menu.
10
Be the first to comment