Suke SSRI EHDi Workshop Slide,
by Lorrie Schmid, April 2015
This workshop, conducted across three sessions, offers an overview of the SAS programming language, focusing on data management activities. Session 1 is a general overview of SAS with a focus on major SAS components (Program Editor, Log and Output), core concepts of SAS programming (DATA and PROC), and issues of importing/exporting, reading, and writing SAS datasets. Session 2 focuses on data modification, including variable creation and variable recoding, as well as adding to and subsettting from particular datasets. Key SAS statements described include: ATTRIB, SET, WHERE, IF-THEN, MERGE. Session 3 focuses on data analysis, specifically descriptive analyses typically used in data management. Key SAS statements described include: PROC CONTENTS, PROC MEANS, and PROC FREQ. Together, these sessions allow researchers to learn basic data management processes using the SAS statistical system.
2. Introduction to SAS Procedures
• SAS data set information
– PROC CONTENTS
– PROC PRINT
• Descriptive statistics
– PROC MEANS
– PROC UNIVARIATE
– PROC FREQ
– PROC CORR
3. What does my SAS
Data Set Contain?
• How many observations?
• How many variables?
• What kinds of variables?
4. PROC CONTENTS
• Provides information about the contents of
a SAS data set.
• Example Syntax:
PROC CONTENTS <options>;
TITLE ‘Contents Listing’;
RUN;
5. PROC CONTENTS - Options
• DATA = <SAS data file>
• OUT = <SAS data file>
• DETAILS / NODETAILS
• ORDER=
• VARNUM
6. PROC CONTENTS
• Key items to look for:
• Data set name
• # of observations
• # of variables
• Date data set was created and last modified
• List of variables with type, format and label
7. PROC CONTENTS –
Example Program
*** This shows an example of PROC CONTENTS using the ***
*** example data set ***;
PROC CONTENTS DATA = sas.example;
TITLE 'Contents listing - Example data set';
RUN;
*** A TITLE statement includes the keyword TITLE and quotes ***
*** either single or double - that give the output some ***
*** meaningful title - best practice is to include ***
*** the name of the procedure and the data set being used ***;
8. PROC CONTENTS – Example Output
Data Set Name SAS.EXAMPLE Observations 203
Member Type DATA Variables 17
Engine V9 Indexes 0
Created Wed, Feb 25, 2015 11:32:49 AM Observation Length 104
Last Modified Wed, Feb 25, 2015 11:32:49 AM Deleted Observations 0
Protection Compressed NO
Data Set Type Sorted NO
Label
Data Representation WINDOWS_64
Encoding wlatin1 Western (Windows)
Engine/Host Dependent Information
Data Set Page Size 12288
Number of Data Set Pages 2
First Data Page 1
Max Obs per Page 117
Obs in First Data Page 90
Number of Data Set Repairs 0
Filename C:UsersschmidDesktopRandom SAS stuffexample.sas7bdat
Release Created 9.0301M2
Host Created X64_7PRO
9. PROC CONTENTS – Example Output
Alphabetic List of Variables and Attributes
# Variable Type Len Format Informat Label
1 ID Num 8 ID
5 age Num 8 age
2 date Num 8 DATE9. DATE9. date
4 gender Num 8 gender
14 livewdad Num 8 livewdad
13 livewmom Num 8 livewmom
12 momcode Char 1 $1. $1. momcode
11 momed Num 8 momed
17 nontrad Char 1 $1. $1. nontrad
3 race Num 8 race
6 sensation_seeking Char 1 $1. $1. sensation seeking
7 senseek1 Num 8 senseek1
8 senseek2 Num 8 senseek2
9 senseek3 Num 8 senseek3
10 senseek4 Num 8 senseek4
15 totfam Char 1 $1. $1. totfam
16 tradfam Char 1 $1. $1. tradfam
10. PROC PRINT
• PROC PRINT -> prints a list of observations
in a SAS data set.
• Example syntax:
PROC PRINT <options>;
WHERE condition;
VAR variable list;
BY variable list;
SUM variable list;
TITLE ‘Print’;
RUN;
12. PROC PRINT – VAR statement
• Lists the variables to be printed.
• The VAR statement is optional.
• If omitted all the variables in the data set will
be printed.
• Variables are printed in the order listed in the
VAR statement.
13. Example: PROC PRINT syntax
*** This shows an example of PROC PRINT ***
*** using the VAR statement and only printing ***
*** only printing student ID, gender and race ***
*** using the example data set ***;
PROC PRINT DATA = sas.example NOOBS;
VAR id gender race;
TITLE 'Print list - studentid gender and race -
example data set';
RUN;
15. PROC PRINT – BY Statement
• Prints data separately for each group in the
BY variable.
• The BY statement is optional
• When using the BY statement, the data
must first be sorted by the variable(s) listed
in the BY statement.
16. PROC PRINT – BY syntax
*** This is an example of a PROC PRINT using the BY statement ***;
PROC SORT DATA = sas.example;
BY age;
RUN;
PROC PRINT DATA= sas.example;
VAR senseek1--senseek4;
BY age;
TITLE 'PRINT LIST - senseek by age';
RUN;
18. PROC PRINT – WHERE statement
• WHERE statement can be used to display a
subset of the data set.
• The WHERE statement works in the PROC step
as well as the DATA step
19. PROC PRINT – WHERE syntax
*** This is an example of a PROC PRINT using a WHERE ***
*** statement using the example data set ***;
DATA one;
SET sas.example;
PROC PRINT;
WHERE age = 14
VAR age senseek1--senseek4;
TITLE 'PRINT - Age 14 - sensation seeking using example data';
RUN;
22. PROC MEANS
• Example Syntax:
PROC MEANS <options> <statistic keyword list>;
WHERE condition;
VAR variable list;
CLASS variable list;
BY variable list;
OUTPUT <OUT = SAS dataset>;
RUN;
23. PROC MEANS - Options
• DATA =
• Classification levels control
• Output control
• Output dataset control
• Statistical analysis control
24. PROC MEANS – STATISTIC KEYWORDS:
DEFAULT
• Statistics printed by default
• N – Number of observations
• MEAN – mean
• STD – standard deviation
• MIN – minimum value
• MAX – maximum value
25. PROC MEANS EXAMPLE 1
*** This program is a standard PROC ***
*** MEANS looking at four sensation ***
*** seeking items in the example dataset ***;
PROC MEANS DATA = sas.example;
VAR senseek1--senseek4;
TITLE 'Standard output for means using example
data';
RUN;
26. PROC MEANS EXAMPLE 1 - OUTPUT
Variable N Mean Std Dev Minimum Maximum
sseek1
sseek2
sseek3
sseek4
158
158
158
158
3.46
2.78
2.62
3.28
1.37
1.43
1.47
1.39
1.00
1.00
1.00
1.00
5.00
5.00
5.00
5.00
27. PROC MEANS – OTHER STATISTIC
KEYWORDS
• CLM = two sided
confidence limits
• Median = Median
• NMISS = Number of
missing values
• P10 = 10% quantile
• Q1 = 25% quantile
• Range = the range
• STDERR = Standard
error of the mean
• SUM = Sum
• VAR = Variance
• T = Student’s t
28. PROC MEANS – EXAMPLE 2
*** This program is a standard PROC MEANS on 2 items and doing a paired t-test ***;
DATA one;
SET sas.example;
PROC MEANS;
VAR senseek1 senseek2;
TITLE 'Means of the 2 sensation seeking items to be used in t test';
RUN;
DATA two;
SET one;
ATTRIB difseek label = 'Differences between senseek1 and senseek2';
difseek = senseek2 - senseek1;
PROC MEANS n mean stderr t prt;
VAR difseek;
TITLE 'T test - differences between senseek2 and senseek1';
TITLE2 'Example data set';
RUN;
29. PROC MEANS –
EXAMPLE 2 OUTPUT
Variable Label N Mean Std Dev Minimum Maximum
senseek1
senseek2
senseek1
senseek2
158
158
3.4620253
2.7848101
1.3760281
1.4336350
1.0000000
1.0000000
5.0000000
5.0000000
Analysis Variable : difseek Differences between senseek1 and
senseek2
N Mean Std Error t Value Pr > |t|
158 -0.6772152 0.1126051 -6.01 <.0001
30. PROC MEANS – CLASS Statement
• Class statement: calculates statistics for
each group in CLASS variable.
• CLASS variables can be numeric or
character.
• Data does not need to be sorted to use
CLASS statement.
31. PROC MEANS –CLASS syntax
*** This program includes a CLASS statement. ***
*** The CLASS statement creates separate analyses ***
*** for each category of data specified by the CLASS ***
*** statement. ***;
PROC MEANS DATA = sas.example;
CLASS age;
VAR senseek1--senseek4;
TITLE 'means of sensation seeking items by age';
TITLE2 'example data set';
RUN;
33. PROC UNIVARIATE
• Provides descriptive statistics for numeric
variables (mean, standard deviation, range,
min, max, etc.)
• Provides more detailed information on the
distribution of a variable (extreme values,
plots of distribution, etc.)
34. PROC UNIVARIATE
• Example Syntax:
PROC UNIVARIATE <options>;
WHERE condition;
VAR variable list;
BY variable list;
FREQ variable list;
RUN;
35. PROC UNIVARIATE Syntax
*** This is an example of a standard PROC ***
*** UNIVARIATE program. It uses the variable - ***
*** mom's education - in the example dataset ***;
PROC UNIVARIATE data = sas.example;
VAR momed;
TITLE 'Univariate - mom's education - example
dataset';
RUN;
36. PROC UNIVARIATE – Output
Moments
N 155 Sum Weights 155
Mean 3.66451613 Sum Observations 568
Std Deviation 1.15813004 Variance 1.34126519
Skewness -0.7379811 Kurtosis -0.3846022
Uncorrected SS 2288 Corrected SS 206.554839
Coeff Variation 31.6039007 Std Error Mean 0.09302324
Basic Statistical Measures
Location Variability
Mean 3.664516 Std Deviation 1.15813
Median 4.000000 Variance 1.34127
Mode 4.000000 Range 4.00000
Interquartile Range 1.00000
Tests for Location: Mu0=0
Test Statistic p Value
Student's t t 39.39355 Pr > |t| <.0001
Sign M 77.5 Pr >= |M| <.0001
Signed Rank S 6045 Pr >= |S| <.0001
Quantiles (Definition 5)
Quantile Estimate
100% Max 5
99% 5
95% 5
90% 5
75% Q3 4
50% Median 4
25% Q1 3
10% 2
5% 1
1% 1
0% Min 1
37. PROC UNIVARIATE – Output (cont.)
Extreme Observations
Lowest Highest
Value Obs Value Obs
1 199 5 190
1 185 5 191
1 169 5 193
1 137 5 194
1 95 5 196
Missing Values
Missing
Value Count
Percent Of
All Obs
Missing
Obs
. 48 23.65 100.00
38. PROC UNIVARIATE - Plots
• Many different visualization options are
available using PROC UNIVARIATE and
coordinating statements
– HISTOGRAM
– PPPLOT
– PROBPLOT
– QQPLOT
– CDFPLOT
39. PROC UNIVARIATE – Plot Syntax
LIBNAME sas "C:UsersschmidDesktopRandom SAS stuff";
DATA one;
SET sas.example;
PROC UNIVARIATE PLOT;
VAR momed;
TITLE 'General plots given by univariate procedure for momed - SAS example data';
RUN;
PROC UNIVARIATE PLOT;
VAR momed;
HISTOGRAM;
TITLE 'Histogram given by univariate procedure for momed - SAS example data';
RUN;
42. PROC FREQ
• Provides descriptive statistics in the form of
frequencies and crosstabulation tables.
• Provides statistics to analyze the relationships
between variables.
43. PROC FREQ
• Example Syntax:
PROC FREQ <options>;
BY variable list;
TABLES variable list </options>;
TEST <options>;
OUTPUT <OUT=DATA>;
RUN;
*If TABLES statement is omitted, one-way tables will be
generated for all variables.
45. PROC FREQ – Basic Table Syntax
*** This is a standard PROC FREQ program. ***
*** The variables used are race and gender ***
*** Refresher information on formats ***
*** Example dataset continues to be used ***;
PROC FORMAT;
VALUE gendfmt 1 = 'Male'
2 = 'Female';
VALUE racefmt 1 = 'AA'
2 = 'White'
3 = 'Hispanic'
4 = 'Multi'
5 = 'Other';
PROC FREQ;
TABLES gender race;
FORMAT gender gendfmt. race racefmt.;
TITLE 'Frequency: gender and race';
TITLE2 'DATA SET: example';
RUN;
46. PROC FREQ – Basic Table Output
gender
gender Frequency Percent
Cumulative
Frequency
Cumulative
Percent
Male 84 50.00 84 50.00
Female 84 50.00 168 100.00
Frequency Missing = 35
race
race Frequency Percent
Cumulative
Frequency
Cumulative
Percent
AA 79 47.02 79 47.02
White 68 40.48 147 87.50
Hispanic 12 7.14 159 94.64
Multi 5 2.98 164 97.62
Other 4 2.38 168 100.00
Frequency Missing = 35
47. PROC FREQ
• Provides various forms of crosstabulation
tables.
• One-way frequencies -> generates a table with the frequency of
the different values of a variable.
• Two-way crosstabulation table -> generates a frequency table with
the values of the two variables.
• N-way crosstabulation table -> generates a n-way frequency table
with the values of the n variables.
48. PROC FREQ – Crosstab Syntax
*** This program is an example of a crosstab ***
*** available as part of the PROC FREQ. ***
*** Variables used are gender and race ***
*** in the example dataset ***;
PROC FREQ;
TABLES race*gender;
FORMAT gender gendfmt. race racefmt.;
TITLE 'Crosstab - gender and race';
RUN;
49. PROC FREQ – Crosstab Output
Table of race by gender
race(race) gender(gender)
Frequency
Percent
Row Pct
Col Pct Male Female Total
AA 36
21.43
45.57
42.86
43
25.60
54.43
51.19
79
47.02
White 37
22.02
54.41
44.05
31
18.45
45.59
36.90
68
40.48
Hispanic 5
2.98
41.67
5.95
7
4.17
58.33
8.33
12
7.14
Multi 3
1.79
60.00
3.57
2
1.19
40.00
2.38
5
2.98
Other 3
1.79
75.00
3.57
1
0.60
25.00
1.19
4
2.38
Total 84
50.00
84
50.00
168
100.00
Frequency Missing = 35
50. PROC FREQ – TABLES Statement
Options
• LIST -> A list rather than a table.
• MISSING -> Missing values are included in
calculations of percentages.
• NOCOL -> Suppresses column percentages.
• NOROW -> Suppresses row percentages.
51. PROC FREQ – TABLES statement
options
• Agree -> Test and measures of classification
agreement.
• CHISQ -> Chi Square test of association
• CL -> Confidence limits
• CMH -> Mantel-Haenszel statistics
• MEASURES -> Association between variables
52. PROC FREQ – TABLES syntax
*** This program is an example of a crosstab using variables race and gender ***
*** Specifically, this shows an example of the options: LIST and MISSING ***;
LIBNAME sas " ";
PROC FORMAT;
VALUE gendfmt 1 = 'Male'
2 = 'Female';
VALUE racefmt 1 = 'AA'
2 = 'White'
3 = 'Hispanic'
4 = 'Multi'
5 = 'Other';
PROC FREQ data = sas.example;
TABLES race*gender/LIST MISSING;
FORMAT gender gendfmt. race racefmt.;
TITLE 'FREQ: Gender and race crosstab - SAS dataset - Example';
RUN;
53. PROC FREQ – TABLES Output
race gender Frequency Percent Cumulative
Frequency
Cumulative
Percent
. . 35 17.24 35 17.24
AA Female 36 17.73 71 34.98
AA Male 43 21.18 114 56.16
White Female 37 18.23 151 74.38
White Male 31 15.27 182 89.66
Hispanic Female 5 2.46 187 92.12
Hispanic Male 7 3.45 194 95.57
Multi Female 3 1.48 197 97.04
Multi Male 2 0.99 199 98.03
Other Female 1 0.49 200 98.52
6 Female 2 0.99 202 99.51
6 Male 1 0.49 203 100.00
54. PROC CORR
• Creates a correlation coefficient that measures
the relationship between two variables.
• Example Syntax:
PROC CORR <options>;
BY <variable list>;
VAR <variable list>;
WITH <variable list>;
RUN;
55. PROC CORR – Basic Syntax
*** This uses the example dataset to conduct a ***
*** PROC CORR. The correlation matrix includes: ***
*** race, gender, age and the four sensation ***
*** seeking items ***;
PROC CORR data = sas.example;
VAR race gender age senseek1--senseek4;
TITLE 'Correlation matrix of variables in example data set';
RUN;
59. PROC CORR - ALPHA
• Only available as part of the Pearson
Correlation statistics
• Internal consistency test for scales of items
that appear to be latent constructs.
• Higher positive scales are better.
• How high is good enough? Depends on the
research.
• Missing data can cause error – use NOMISS
option.
60. PROC CORR – Alpha
*** This program will include a correlation matrix ***
*** and Cronbach's coefficient alpha to assess internal
reliability ***
*** using the example data set ***;
PROC CORR alpha nomiss data = sas.example;
VAR senseek1--senseek4;
TITLE 'Alpha - sensation seeking variables in example data
set';
RUN;
61. PROC CORR – Alpha Output
Simple Statistics
Variable N Mean Std Dev Sum Minimum Maximum Label
senseek1 158 3.46203 1.37603 547.00000 1.00000 5.00000 senseek1
senseek2 158 2.78481 1.43363 440.00000 1.00000 5.00000 senseek2
senseek3 158 2.62658 1.46937 415.00000 1.00000 5.00000 senseek3
senseek4 158 3.28481 1.38735 519.00000 1.00000 5.00000 senseek4
Cronbach Coefficient Alpha
Variables Alpha
Raw 0.743971
Standardized 0.745506
63. A Note About Missingness
• Whole courses can and have been taught
about missing data
• What about missing data and analysis?
• Know your data –> and that includes missing
data
• Talk to your team -> standards for handling
missing data
• Applications to correct for missingness
64. Basics of Output Delivery System
(ODS)
• Procedures only produce data.
• Output Delivery System (ODS) determines
where output should go and what it should
look like.
• Many different ways to display output.
• The example that I use the most – and will
describe here- creates RTF formatted
documents
65. ODS RTF (Rich Text Format)
• ODS RTF is an easy way to create output that can
be directly used in Word reports and PowerPoint
presentations
• Example Syntax:
ODS RTF <options>;
procedures to be run;
ODS RTF CLOSE;
• Key Options
– FILE = Where output is placed: “pathname of file.rtf”;
– STYLE = Style definitions; see documentation
66. ODS RTF Syntax
*** This example uses the ODS RTF commands to create ***
*** a RTF output of means. Notice the use of the STYLE = ***
*** options to set it up in a APA-like format ***;
ODS RTF FILE = "C:UsersschmidDesktopRandom SAS
stuffmeans.rtf" STYLE=JOURNAL;
PROC MEANS DATA = sas.example;
VAR senseek1--senseek4;
TITLE 'Standard output for means using example data';
RUN;
ODS RTF CLOSE;
67. ODS RTF Output
Variable Label N Mean Std Dev Minimum Maximum
senseek1
senseek2
senseek3
senseek4
senseek1
senseek2
senseek3
senseek4
158
158
158
158
3.4620253
2.7848101
2.6265823
3.2848101
1.3760281
1.4336350
1.4693652
1.3873485
1.0000000
1.0000000
1.0000000
1.0000000
5.0000000
5.0000000
5.0000000
5.0000000