2. What is Stata?
Stata is a general-purpose integrated statistical software package created in 1985 by StataCorp LP.
It is a powerful statistical software that enables users to analyze, manage, and produce graphical
visualizations of data.
It provides commands to conduct statistical tests, and econometric analysis including panel data
analysis (cross-sectional time-series, longitudinal, repeated-measures), cross-sectional data, time-
series, survival-time data, cohort analysis, etc
Stata can be used either through dropdown menus or using commands.
It is user friendly, it has an extensive library of tools and internet capabilities, which install and
update new features regularly.
3. Versions of STATA.
There are three versions of STATA.
All of the three version are available for 32-bit and 64-bit computers. The major differences
among the version are of observations and variables handling capacity along with data
processing speed.
1) Stata /IC (or Intercooled Stata) – It can handle up to 2,047 variables.
2) Stata/SE (Special Edition) – It can handle up to 32,766 variables (and also allows longer string
variables and larger matrices).
3) Stata/MP (Multicore/Multiprocessor) – It has the same variable handling capacity as of
Stata/SE. However it is substantially faster and efficient for multicore computers.
5. Open a
.dta file
Save and
print the file
Log file
.do file
Data Editor
Variable Manager
6. Do file setup
* do files - Stata do-files are text files where users can store and run their commands for reuse, rather
than retyping the commands into the Command window.
◦ It is used due to Reproducibility, Easier debugging and changing commands.
◦ The file extension .do is used for do-files.
◦ doedit (doe) is used as command to open the do file.
Stata 16 features an enhanced editor that features tab auto-completion for Stata commands and
previously typed words
*To run a command from the do-file, highlight part or all of the command, and then hit Ctrl-D (Mac:
Shift+Cmd+D) or the “Execute(do)” icon, the rightmost icon on the do-file editor toolbar
*Multiple commands can be selected and executed
7. Syntax highlighting
The do-file editor colors Stata commands “blue”
Comments, which are not executed, are usually preceded
by * and are colored “green”
Words in quotes (file names, string values) are colored
“red”
Stata 16 features an enhanced editor that features tab
auto-completion for Stata commands and previously typed
words
8. Running commands
from the do-file
To run a command from the do-file, highlight
part or all of the command, and then hit Ctrl-D
(Mac: Shift+Cmd+D) or
The “Execute(do)” icon, the rightmost icon
on the do-file editor toolbar
Multiple commands can be selected and
executed
9. COMMENTS
Comments are not executed, so provide a
way to document the do-file.
Comments are either preceded by * or
surrounded by /* and */
Comments will appear in green in the do-file
editor
10. Stata will normally assume that a newline signifies the end of a
command
You can extend commands over multiple lines by placing /// at
the end of each line except for the last
Make sure to put a space before ///
When executing, highlight each line in the command(s)
long lines in do-files
11. Rules to define a variable -
A) English alphabet - upper or lower case (variable names, as commands are case sensitive),
B) Numbers – Any number starting from 0 to 9 can be used. Although the first character cannot
be a number.
C) Symbol – the underscore (_) symbol
D)The name can have up to 32 characters.
Example – age, AGE, age_1, age3, age32, AGE_1 (Correct form)
1age, 1AGE, age@, age@1, 3_age, age&3 (Incorrect form)
12. use load Stata dataset
save save Stata dataset
clear clear dataset from memory
import import Excel dataset
excel
Importing data
Using drop down menu:
file- >import->Excel
spreadsheet
13. Viewing data
browse open spreadsheet of data
list print data to Stata console
Once the data are loaded, we can view the dataset
as a spreadsheet using the command browse
The magnifying glass with spreadsheet icon also
browses the dataset
Black columns are numeric, red columns are
strings, and blue columns are numeric with string
labels
14. Operators and Expressions
These are key arithmetic, logical and relational operators you need to keep in mind:
Arithmetic Logical Relational
+ add ! not (also ~) == equal
- Subtract | or != not equal (also ~=)
* multiply & and < less than
/ divide <= less than or equal
^ raise to power > greater than
+ string concatenation >= greater than or equal
Use display command to use stata as calculator
15. Selecting observations
in select by observation number
Many commands are run on a subset of the data set
observations
in selects by observation (row) number
Syntax
in firstobs/lastobs
30/100 – observations 30 through 100
Negative numbers count from the end
“L” means last observation
-10/L – tenth observation from the last through
last observation
if select by condition
if selects observations that meet a certain
condition
gender == 1 (male)
math > 50
if clause usually placed after the command
specification, but before the comma that precedes
the list of options
The basic structure of using IF is : command if exp,
16. Exploring data
codebook inspect variable values -
Summarize summarize distribution
describe describe the variables
tabulate tabulate frequencies (tab, tab1, tab2), row, column
tabstat tabulation of statistics
17. Data Management
generate create variable
egen extended variable generation
replace replace values of variable
rename rename variable
recode recode variable values
label variable give variable description
label define generate value label set
label value apply value labels to variable
keep keep variables, drop others
drop drop variables, keep others
keep if keep observations, drop others
drop if drop observations, keep others
sort sort by variables, ascending
gsort ascending and descending sort
18. gen command creates a new variable using an expression that may combine constants, variables, functions, and
arithmetic and logical operators
gen id=_n /* id number of observation */
gen total=_N /* total number of observations */
gen ten=10 /* constant value of 10 */
gen tensq = ten^2 /* squared of ten*/
gen lnten = log(ten) /* generates ten in log form */
The egen command creates new variables based on summary measures, such as sum, mean, min and max. For
example
Generate (gen) command
19. replace
The typical syntax to replace values of an existing variable is:
replace oldvar = exp [if] [in]
the exp are similar to those used for the generate command above and can use the oldvar. Here
are two examples:
replace oldvar = oldvar * 5.
replace oldvar = oldvar * -1 if oldvar < 0
*recode
This command is useful to deal with missing values or special codes in the existing variables and to
change the existing values of categorical variables.
recode varlist (rule) [(rule) ...] [, generate(newvar)]
Rule Example Meaning
# = # 3 = 1 3 recoded to 1
# # = # 2 . = 9 2 and . recoded to 9
#/# = # 1/5 = 4 1 through 5 recoded to 4
nonmissing = # nonmiss = 8 all other nonmissing to 8
missing = # miss = 9 all other missings to 9
20. Raw datasets, especially large ones, often contain variable names which are not intuitive. For example, don’t be
surprised to find variable named a001s01 or d23s02r34. For this reason, it is important to “label” variables so
that we understand what exactly they mean, but variable labels cannot be so long that they appear verbose. To
attach a label to a variable use the label variable command in the following way
label variable name “name of the head of the household”
Labelling Variables and Values
*The values of categorical variables have a meaning unlike those of continuous variables. This is achieved by
first defining a label of the values and then applying those value labels to a variable as shown below.
label define sexlbl 0 “Female” 1 "Male”
label values sexhead sexlbl
21. *Creating a dummy variable using gen command
*‘By’ group processing
To execute a Stata command separately for groups of observations for which the values of the variables in
varlist are the same, type:
by varlist: command
Most commands allow the by prefix, but data should be sorted by varlist (precede command with sort varlist
or use bysort):
bysort varlist: command
Examples:
bysort id: summarize varname
bysort id: tabulate varname
bysort id: ta varname if varname>=18
23. Some other statistical commands.
summarize : descriptive statistics
correlate : correlation matrices
ttest : perform 1-, 2-sample and paired t-tests
anova : 1-, 2-, n-way analysis of variance
regress : least squares regression
predict : generate fitted values, residuals, etc.
test : test linear hypotheses on parameters
logit, logistic : logit model, logistic regression
probit : binomial probit model
tobit : one- and two-limit Tobit model
cnsreg : Censored normal regression (generalized Tobit)
reg3 : three-stage least squares
lincom : linear combinations of parameters
cnsreg : regression with linear constraints
testnl : test nonlinear hypothesis on parameters
margins : marginal effects (elasticities, etc.)
ivregress : instrumental variables regression
prais : regression with AR(1) errors
sureg : seemingly unrelated regressions
qreg : quantile regression
ologit, oprobit : ordered logit and probit models
mlogit : multinomial logit model
poisson : Poisson regression
heckman : selection model
24. Importing txt. File (Fixed Width Data)
Very often, important datasets carry textual information on each household, individual, or firm. In Stata,
one can import the “txt. data” using the following command:
infix specifications using <filename>
In this example, for importing the text file provided by the NSSO, i.e. “PLFS data, the following command is
used:
infix id 1-3 FSU 4-8 Round 9-10 Schedule 11-13 Sample 14 Sector 15 State 16-18 Dist 19-20 Stratum 21-
22 Sub 23-24 Sub_round 25 Sub_Sample 26 FOD 27-30 HG 31 Second_Stage_Str 32 Sample_HH_No 33-
34 level 35-36 filler 37-41 Informant_sl_no 42-43 response_code 44 survey_code 45 subst_code 46 using
“c:PLFSDataABCD.TXT"