Introduction to Stata

Faculty of Economics and Political Science, Cairo University
STATA9
:Instructor
Samaa H. Hosny
Ph.D. Candidate
Sunday-Wednesday
10-14 May 2009

:Section 1
Introduction and Overview

1. Stata interface:
- windows,
- icons vs. syntax, and
- initial output
2. Stata community and website:
www.stata.com

Compare Stata with SPSS. 3
 As per statistical capabilities, Stata can do a
lot more than SPSS, i.e. more advanced.
 SPSS is more inclined towards the business
world.
 Stata is more inclined towards the research
community. It offers a helpful exchange of
ideas and experience between its academic
users.

Compare Stata with SPSS . 3
((cont’d
 Can receive updates as well as ado files, i.e. you
don’t need to wait for a new version to run new
commands.
 Compare the two websites: www.spss.com and
www.stata.com
 Join the mailing list of updates: send a message to
majordomo@hsphsun2.harvard.edu and write:
 subscribe statalist email@address
 OR for daily summary you can write:
subscribe statalist-digest email@address

4. Introducing the three Stata
editions:
 Stata/SE: special edition
 Stata/IC: Intercooled
 Stata/small: small (educational)
5. Two dimensions of data: cases and
variables

Stata/ SE Stata/IC /Stata
small
Max. no. of
variables
32,766 2,047 99
Max. no. of
observations
2,147,483,647 2,147,483,647 1,000
Max. no. of
characters for
a string
variable
244 80 80
Matrices x 1,000 1,000 x 800 800 x 40 40

6. Help: Built-in (offline) / internet (online)
- From Stata: Help >> Search >>Search all
- findit keyword
- Very helpful links at:
http://www.stata.com/links/resources1.html
7. File Types:
- Data file: filename.dta
- Do file: filename.do
- Ado file: filename.dta
- Log file (only readable in Stata): filename.smcl
- Log file (text file): filename.log

Main tasks. 8
1. Accessing data
2. Entering data in Stata
3. Convert data (via StatTransfer) from formats such
as text, Excel, SPSS, SAS, and other softwares
4. Save Stata format as a .dta file
5. File and data preparation
6. Descriptive statistics
7. Tabulating data: frequency tables and cross
tabulations
8. Graphics
9. Data analysis
10. Preparing a report using Stata output: (creating a
Word document)

Good Practices. 9
1. Documentation within a program (using the *)
2. Intuitive variable names, labels, and file names
3. Avoid destroying or over-writing original data
4. Appropriate use of command abbreviation
5. Keeping a record of work: commands (do) and
outputs (log)

:Section 2
Getting Started
(Hands-On Applications(

Opening and closing Stata. 1
Like any other software:
Start button>>Programs>>Stata
OR
Double click on the software icon on the
desktop

Memory issues: checking and . 2
setting memory capacity
 To check memory status (default is 1m = 1
Megabytes)
memory
 To change the memory (needed for large
data sets)
set memory 750m
set memory 750m,perm

Current directory and . 3
changing it
 The current directory is found in the status bar as a path
below the 4 windows. To use and save files without the
need to rewrite the whole required path every time we
write a “use” or “save” command, we change the directory
to the one we want to deal with directly as follows:
cd D:StataData //since no spaces included in the name
OR
cd “D:StataData Files” //a space in the path name
requires using quotes

4. Opening and closing files
(data, log, and do)
 Open file existing in any directory:
use C:folderfilename.dta
 Open file existing on the current directory:
use filename
 Open specific variables :
use age region using C:folderfilename.dta
 Open specific cases :
use C:folderfilename.dta if male==1 //using a certain category
use C:folderfilename.dta in 1/10 //using the first 10 cases only

Preparing a report using . 5
Stata output:
((creating a Word document
 Steps:
1. Open a log file to save all contents of the session (commands and
outputs( using:
log using filename.log //to have a text file not an .smcl file
2. Carry on all the analysis required, then write:
log close
3. Open this file using any .txt reader and copy it to a Word file
4. Format the Word file with font: “Courier New” of size 8 or 9 to have
exactly the same shape of output as in the Stata output window

Viewing existing data in a . 6
data file
 To view all existing variable names, specification, labels, number of
variables and observations we use:
describe
d, fullnames //to avoid abbreviations in names
 To namely select some of them we use:
describe region1 region2
 To select all those starting with the same letters (e.g. reg( we use:
d reg* // describe all variables starting with reg-d
*tion // describe all variables ending with -tion

Viewing existing data in . 6
)a data file (cont’d
 To view all existing observations:
list
 To view some selected observations:
list in 1/5 //1st 5 observations
 To view some selected variables:
l age gov sex in 1/10 //1st 10 observations in the three variables
l age gov sex if male==1 //only males in the three variables
li X* //all variables starting with X

Viewing existing data in . 6
)a data file (cont’d
Important Notes!
 To avoid running very long outputs in general, for
example all the observations (in case of very large
datasets( we can use: the Break icon in the toolbar
or from the Keyboard: Ctrl+C at any time to stop
getting more output from the same command.
 To permanently switch off the –more- option
between pages of output we type:
set more off

Entering and saving . 7
data
1. Manually through the keyboard: string variables
should be specified as str before the varname (e.g.
var3 is string of 9 places, it’s str9(:
input var1 var2 str9 var3
val11 val12 “val13”
.
.
.
valN1 valN2 valN3
end

data
2. Manually through the data editor
 Enter values in the table cell by cell (where the
cursor (colored cell( is.
 Double click on the varname and edit its name,
label, and format.

data
3. Download or search for datasets by:
 Typing in the command window:
help datasets
 searching www.stata.com for datasets
 searching the internet

data
4. Using StatTransfer to transfer any
spreadsheet into Stata format (The
best way in order not to lose any
data( as well as maintaining all
variable labels and storage types (in
case the file was in SPSS or any
other statistical package saving
information about the variables(

data
4. Save an Excel file with variable header (i.e.
varnames in the first row(>>select all>> copy
from Excel sheet >>highlight the upper left cell in
Stata data editor>>paste
6. Save an Excel file using tab delimited format
(.txt( without variable headers (i.e. all columns
are values(
Then type in Stata command window:
insheet using Book1.txt

Entering and saving data . 7
)(Precautions
 Take care of any data that might have been
missed while transferring to Stata without
StataTransfer
 Make sure you label the variables and
rename them in Stata after the insheet
 Also check Stata infile command
 Note that Stata10 reads directly from Excel
by using the file icon in the Stata interface.

Labeling data, variables, . 8
and values
label data “This is Employment data”
label variable employ "Employment Status”
label define employed 0 “unemployed" 1 “employed”
label values employ employed
label define employed 2 "Other", add
OR
label define employed 0 "0: No" 1 "1: Yes" 2 "2: Other",
modify

9. Describing and tabulating
data
1. An overview of data
• The first step is to see the data (variables
and observations) by the ‘list’ and
‘describe’ commands.
• See the labels of a variable in full name
label list name
NB! Here we type the name of the label list
NOT the varname

data
2. Summarizing data (descriptive statistics):
 For quantitative data (numeric variables only)
summarize
 To show basic descriptives of var X: i.e. No. of obs., mean,
st.dev., min, & max values
sum X
 To show detailed descriptives of var X: basic + percentiles,
variance, and skewness
sum X, detail

data
3. Frequency tables
tabulate X
ta X, nolabel //shows codes NOT labels
tab1 X1 X2 X3 X4 //for each one separately
ta X, summarize(Y) //summarizes Y for each category of X

9. Describing and
tabulating data
4. Cross tabulations
 Can take up to 2 variables: Y on rows, X on columns with totals:
ta Y X
ta X1 X2, row //displays row percentages (% for each category)
ta X1 X2, row nofreq //displays row percentages without
frequencies
bysort X: tab Y, summarize(Z) missing
//for each categ. of X (including the missing categ.),
we tabulate Y and calculate basic descriptives of Z

9. Describing and
tabulating data
4. Crosstabs (cont’d)
 Another command. More flexible in options esp. weights.
Can take up to 3 variables, with Y as the rows and X2 as the columns
for each category of X1.
table Y X1 X2, row col //a new row and col. for totals
table Y X, by(Z) //a separate table for Y on rows
and X on columns for each category of Z

10. Data manipulation
1. Creating case number (case id)
generate id=_n
2. Deleting existing variables/cases
drop X //deletes variable X
keep X Y Z //deletes all other variables
drop if gov==1 //deletes all cases in this governorate
drop if ~male //deletes females and missing values
keep if age>=15 & age<=60 //deletes all other cases

3. Dealing with Variable Groups:
 Grouping variables in a variable set
global set1 “X1 X2 X3 X4”
 When we use this variable set in any
command, we call it by adding a $ before
the name. For example:
tab1 $set1

3. Dealing with Variable Groups:
 The use of dash (-)
for var X1-Y10: rename XX_2009
//will be executed on variables X1 to Y10
 The use of star (*): (previously discussed)
describe X* des demo*99
list *Y* list *W

4. Creating new variables (gen)
generate y=1
g z=1 if (x=5)
gen samplesize=_N
//column of a constant=total number of
observations in the dataset (total sample size)
bysort family: gen famsize=_N
//column of constants=total number of observations in
each family (add up to sample size)

 Creating new variables (gen) (cont’d)
gen l_income=log(income) //natural log
OR gen l_income=ln(income) //natural log
gen loginc=log10(income) //base 10 log
gen Y=sqrt(X) //get square-root of X
gen Z=exp(Y) //get the exponential
gen sqage=age^2 //get the square age
gen XY=X*Y //interaction term
gen lagYt = Yt[_n-1] //lagYt=Yt-1

Creating new variables (egen and its options)
egen avage = mean(age) //mean age of sample (only 1 value)
bysort hh: egen avg = mean(age) //mean age for every hh
egen meddiff = median(var1-var2) // (exp, - means subtraction)
median of the difference
egen avginc = rowmean(W X Y Z)
OR
egen avginc = rowmean(W - Z) //(varlist, - means through)
egen ttlsales = total(sales), by(region)

Dummy variable construction:
 Manual (allowing missing values)
gen female=0 if sex==2
replace female=0 if sex==1
 Automatic (NOT allowing missing values)
gen married=(mrtst==2) //generate a dummy for married
tab region, gen(region)
//generate 6 dummy variables for the 6 regions
 xi commands for categorical data
xi: tab1 i.region
//can be used with any command. Dummies not saved

11. Graphics
 The commands that draw graphs are
 command description
 ------------------------------------------------------------------------------------------
 graph twoway scatterplots, line plots, etc.
 graph matrix scatterplot matrices
 graph bar bar charts
 graph dot dot charts
 graph box box-and-whisker plots
 graph pie pie charts
 other more commands to draw statistical graphs
 ------------------------------------------------------------------------------------------

11. Graphics
 The commands that save a previously drawn graph,
redisplay previously saved graphs, and combine graphs are
 command description
 -------------------------------------------------------------------------
 graph save save graph to disk
 graph use redisplay graph stored on disk
 graph display redisplay graph stored in memory
 graph combine combine graphs into one
 -------------------------------------------------------------------------

11. Graphics
1. Histograms
histogram X // draws a histogram for variable X
histogram X if male==1 in 1/1000
//histogram for variable X for males only in the first 1000 cases
histogram X, percentage normal
// histogram for variable X along with the normal curve
For more info on options: help histogram

11. Graphics
2. Bar graphs
graph bar X
// draws a bar chart with vertical bars for variable X
graph hbar Y
// draws a bar chart with horizontal bars for variable Y
For more info on options: help graph bar

11. Graphics
3. Scatterplots
graph twoway scatter Y X
twoway scatter Y X
scatter Y X
The above three commands are equivalent.

11. Graphics
graph twoway (scatter y1 x) (scatter y2 x)
// draws a scatter plot for variable y1 against x and for y2 against x
This is equivalent to typing
OR twoway scatter y1 x || scatter y2 x
graph twoway (scatter y x) (lfit y x)
// draws a scatter plot for variable y against x and adds the linear prediction fit
graph matrix X1 X2 X3
// scatterplot matrices for the three variables together (two at a time)
For more info on options: help scatter
OR help graph_twoway

11. Graphics
4. Line graphs
graph twoway line Y X
twoway line Y X
line Y X
The above three commands are equivalent.
For more info on options: help line
OR help graph_twoway

11. Graphics
5. Labeling graph and graph axes
scatter lexp region, title("Scatter plot")
subtitle("Life expectancy at birth, US")
yvarlabel("life expectancy")
xvarlabel("Region")
Note: The whole command should be written on
one line.

11. Graphics
6. Saving graphs
This command is written directly after the graph that
you wish to save:
e.g.
scatter yvar xvar
graph save mygraph //save previous graph
This will create the file mygraph.gph
OR
scatter yvar xvar, saving(mygraph)
graph use mygraph //use saved graph

11. Graphics
7. Combining graphs
e.g. using lifeexp.dta:
scatter lexp region, saving(figure1, replace)
scatter gnppc region, saving(figure2, replace)
graph combine figure1.gph figure2.gph, saving(byregion)

Further topics of interest. 12
 It should be noted that data should be
weighted to be representative of the
population (help weight)
 Stata can merge files (add variables
from one file to another) and append
files (add cases).
(help merge) and (help append)
 Numerous options are present with
every command

Matrices. 13
 To input matrix A:
11 530
550 32130 , we do the following:
matrix input A=(11,530550,32130)
mat list A // to show the matrix content
mat define detA=det(A) // to get the determinant of A
mat define invA=inv(A) // to get the inverse of A
mat define transA=A’ // to get the transpose of A
Mat D=A+B // to get the sum of A and B

Introduction to Stata

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Introduction to Stata

Similar to Introduction to Stata (20)

Recently uploaded

Recently uploaded (20)

Introduction to Stata