The essence of data step programming

717 views
641 views

Published on

The fundamental of SAS programming is DATA step programming. The essence of DATA step programming is to understand how SAS processes the data during the compilation and execution phases. In this paper, you will be exposed to what happens “behind the scenes” while creating a SAS dataset. You will learn how a new dataset is created, one observation at a time, from either a raw text file or an existing SAS dataset, to the program data vector (PDV) and from the PDV to the newly-created SAS dataset. Once you fully understand DATA step processing, learning the SUM and RETAIN statements will become easier to grasp. Relating to this topic, this paper will also cover BY-group processing.

Published in: Education, Business, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
717
On SlideShare
0
From Embeds
0
Number of Embeds
12
Actions
Shares
0
Downloads
37
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

The essence of data step programming

  1. 1. The Essence of DATA Step Programming Arthur Li City of Hope Comprehensive Cancer Center Department of Information Science
  2. 2. INTRODUCTION SAS programming DATA step programming Understanding how SAS processes the data during the compilation and execution phases Fundamental: Essence:
  3. 3. A COMMON BEFUDDLEMENT <ul><li>The newly-created SAS dataset is not what we intended </li></ul><ul><ul><li>there are more or less observations </li></ul></ul><ul><ul><li>the value of the variable was not retained correctly </li></ul></ul><ul><li>Reason: </li></ul><ul><ul><li>Learning only SAS language syntax </li></ul></ul><ul><ul><li>Not understanding the fundamental SAS programming concepts </li></ul></ul>
  4. 4. INTRODUCTION <ul><li>We will cover… </li></ul><ul><li>what happens “behind the scenes” while creating a SAS dataset </li></ul><ul><li>Learn how a new dataset is created </li></ul><ul><ul><li>one observation at a time </li></ul></ul><ul><ul><li>a raw text file/SAS dataset  PDV  SAS dataset </li></ul></ul><ul><li>The SUM and RETAIN statements </li></ul><ul><li>BY-group processing </li></ul><ul><li>Transposing dataset examples </li></ul>
  5. 5. DATA STEP PROCESSING OVERVIEW Compilation phase: Each statement is scanned for syntax errors. Execution phase: The DATA step reads and processes the input data. If there is no syntax error A DATA step is processed in two-phase sequences :
  6. 6. DATA STEP PROCESSING OVERVIEW Program1: data ex1; infile 'C:Arthurexample1.txt' ; input name $ 1 - 7 height 9 - 10 weight 12 - 14 ; BMI = 700 *weight/(height*height); output ; run ; Data Entry Error <ul><li>The column input method: </li></ul><ul><ul><li>Each variable is occupied in a fixed field </li></ul></ul><ul><ul><li>The values are standard character or numerical values </li></ul></ul><ul><li>Creating a new variable: BMI </li></ul>12-14 Weight 9-10 Height 1-7 Name Columns Variable names Barbara 61 12D John 62 175 Example1.txt 12345678901234567890
  7. 7. COMPILATION PHASE data ex1; infile 'C:Arthurexample1.txt' ; input name $ 1 - 7 height 9 - 10 weight 12 - 14 ; BMI = 700 *weight/(height*height); output ; run ; <ul><li>Used to hold raw data </li></ul><ul><li>Will not be created when reading a SAS dataset </li></ul>Input buffer
  8. 8. COMPILATION PHASE data ex1; infile 'C:Arthurexample1.txt' ; input name $ 1 - 7 height 9 - 10 weight 12 - 14 ; BMI = 700 *weight/(height*height); output ; run ; PDV PDV is created Memory area where SAS builds its new data set, 1 observation at a time. Input buffer _N_ D _ERROR_ D
  9. 9. COMPILATION PHASE data ex1; infile 'C:Arthurexample1.txt' ; input name $ 1 - 7 height 9 - 10 weight 12 - 14 ; BMI = 700 *weight/(height*height); output ; run ; PDV PDV is created Automatic variables: _N_ = 1: 1 st observation is being processed _N_ = 2: 2 nd observation is being processed Input buffer _N_ D _ERROR_ D
  10. 10. COMPILATION PHASE data ex1; infile 'C:Arthurexample1.txt' ; input name $ 1 - 7 height 9 - 10 weight 12 - 14 ; BMI = 700 *weight/(height*height); output ; run ; PDV PDV is created Automatic variables: _ERROR_ = 1: signals the data error of the currently-processed observation Input buffer _N_ D _ERROR_ D
  11. 11. COMPILATION PHASE data ex1; infile 'C:Arthurexample1.txt' ; input name $ 1 - 7 height 9 - 10 weight 12 - 14 ; BMI = 700 *weight/(height*height); output ; run ; PDV A space is added to the PDV for each variable Input buffer _N_ D _ERROR_ D Height K Name K Weight K
  12. 12. COMPILATION PHASE data ex1; infile 'C:Arthurexample1.txt' ; input name $ 1 - 7 height 9 - 10 weight 12 - 14 ; BMI = 700 *weight/(height*height); output ; run ; PDV BMI is added to the PDV Input buffer _N_ D _ERROR_ D Height K Name K Weight K BMI K
  13. 13. COMPILATION PHASE data ex1; infile 'C:Arthurexample1.txt' ; input name $ 1 - 7 height 9 - 10 weight 12 - 14 ; BMI = 700 *weight/(height*height); output ; run ; PDV D = dropped K = kept Input buffer _N_ D _ERROR_ D Height K Name K Weight K BMI K
  14. 14. COMPILATION PHASE data ex1; infile 'C:Arthurexample1.txt' ; input name $ 1 - 7 height 9 - 10 weight 12 - 14 ; BMI = 700 *weight/(height*height); output ; run ; PDV <ul><li>Checks for syntax errors </li></ul><ul><ul><li>invalid variable names </li></ul></ul><ul><ul><li>invalid options </li></ul></ul><ul><ul><li>incorrect punctuations </li></ul></ul><ul><ul><li>misspelled keywords </li></ul></ul>Input buffer _N_ D _ERROR_ D Height K Name K Weight K BMI K
  15. 15. EXECUTION PHASE data ex1; infile 'C:Arthurexample1.txt' ; input name $ 1 - 7 height 9 - 10 weight 12 - 14 ; BMI = 700 *weight/(height*height); output ; run ; PDV <ul><li>The DATA step works like a loop </li></ul><ul><li>It repetitively executes statements </li></ul><ul><ul><li>reads data values </li></ul></ul><ul><ul><li>creates observations one at a time </li></ul></ul>Input buffer _N_ D _ERROR_ D Name K Height K Weight K BMI K
  16. 16. EXECUTION PHASE data ex1; infile 'C:Arthurexample1.txt' ; input name $ 1 - 7 height 9 - 10 weight 12 - 14 ; BMI = 700 *weight/(height*height); output ; run ; PDV <ul><li>1 st Iteration: </li></ul><ul><li>At the beginning </li></ul><ul><ul><li>_N_  1, _ERROR_  0 </li></ul></ul><ul><ul><li>The remaining variables are set to missing </li></ul></ul>. . . Input buffer _N_ D _ERROR_ D Name K Height K Weight K BMI K 1 0
  17. 17. EXECUTION PHASE data ex1; infile 'C:Arthurexample1.txt' ; input name $ 1 - 7 height 9 - 10 weight 12 - 14 ; BMI = 700 *weight/(height*height); output ; run ; PDV 1 st Iteration: B a r b a r a 6 1 1 2 D <ul><li>1 st data line  input buffer </li></ul><ul><li>The input pointer @ the beginning of the input buffer </li></ul><ul><li>The INFILE statement identifies the location of Exampl1.txt </li></ul>1 0 . . . Input buffer _N_ D _ERROR_ D Name K Height K Weight K BMI K Barbara 61 12D John 62 175 Example1.txt 12345678901234567890
  18. 18. EXECUTION PHASE data ex1; infile 'C:Arthurexample1.txt' ; input name $ 1 - 7 height 9 - 10 weight 12 - 14 ; BMI = 700 *weight/(height*height); output ; run ; PDV 1 st Iteration: B a r b a r a 6 1 1 2 D 1 0 <ul><li>The INPUT statement reads data values: input buffer  PDV </li></ul>. . . Input buffer _N_ D _ERROR_ D Name K Height K Weight K BMI K Barbara 61 12D John 62 175 Example1.txt 12345678901234567890
  19. 19. EXECUTION PHASE data ex1; infile 'C:Arthurexample1.txt' ; input name $ 1 - 7 height 9 - 10 weight 12 - 14 ; BMI = 700 *weight/(height*height); output ; run ; PDV 1 st Iteration: B a r b a r a 6 1 1 2 D 1 0 <ul><li>input buffer ( columns 1-7)  “Name” in the PDV </li></ul>Barbara . . . Input buffer _N_ D _ERROR_ D Name K Height K Weight K BMI K Barbara 61 12D John 62 175 Example1.txt 12345678901234567890
  20. 20. EXECUTION PHASE data ex1; infile 'C:Arthurexample1.txt' ; input name $ 1 - 7 height 9 - 10 weight 12 - 14 ; BMI = 700 *weight/(height*height); output ; run ; PDV 1 st Iteration: B a r b a r a 6 1 1 2 D 1 0 <ul><li>The input pointer @ column 8 </li></ul>Barbara . . . Input buffer _N_ D _ERROR_ D Name K Height K Weight K BMI K Barbara 61 12D John 62 175 Example1.txt 12345678901234567890
  21. 21. EXECUTION PHASE data ex1; infile 'C:Arthurexample1.txt' ; input name $ 1 - 7 height 9 - 10 weight 12 - 14 ; BMI = 700 *weight/(height*height); output ; run ; PDV 1 st Iteration: B a r b a r a 6 1 1 2 D 1 0 . . <ul><li>input buffer (columns 9-10)  “Height” in the PDV </li></ul>Barbara 61 Input buffer _N_ D _ERROR_ D Name K Height K Weight K BMI K Barbara 61 12D John 62 175 Example1.txt 12345678901234567890
  22. 22. EXECUTION PHASE data ex1; infile 'C:Arthurexample1.txt' ; input name $ 1 - 7 height 9 - 10 weight 12 - 14 ; BMI = 700 *weight/(height*height); output ; run ; PDV 1 st Iteration: B a r b a r a 6 1 1 2 D 1 0 <ul><li>The input pointer @ column 11 </li></ul>Barbara 61 . . Input buffer _N_ D _ERROR_ D Name K Height K Weight K BMI K Barbara 61 12D John 62 175 Example1.txt 12345678901234567890
  23. 23. EXECUTION PHASE data ex1; infile 'C:Arthurexample1.txt' ; input name $ 1 - 7 height 9 - 10 weight 12 - 14 ; BMI = 700 *weight/(height*height); output ; run ; PDV 1 st Iteration: B a r b a r a 6 1 1 2 D 1 0 <ul><li>Tries to read Weight – invalid value </li></ul>Barbara 61 . . Input buffer _N_ D _ERROR_ D Name K Height K Weight K BMI K Barbara 61 12D John 62 175 Example1.txt 12345678901234567890
  24. 24. EXECUTION PHASE data ex1; infile 'C:Arthurexample1.txt' ; input name $ 1 - 7 height 9 - 10 weight 12 - 14 ; BMI = 700 *weight/(height*height); output ; run ; PDV 1 st Iteration: B a r b a r a 6 1 1 2 D <ul><li>Tries to read Weight – invalid value </li></ul><ul><li>_ERROR_  1 </li></ul>Input buffer Barbara 61 12D John 62 175 Example1.txt 12345678901234567890 _N_ D _ERROR_ D Name K Height K Weight K BMI K 1 1 Barbara 61 . .
  25. 25. EXECUTION PHASE data ex1; infile 'C:Arthurexample1.txt' ; input name $ 1 - 7 height 9 - 10 weight 12 - 14 ; BMI = 700 *weight/(height*height); output ; run ; PDV 1 st Iteration: B a r b a r a 6 1 1 2 D <ul><li>The input pointer @ column 15 </li></ul>Input buffer Barbara 61 12D John 62 175 Example1.txt 12345678901234567890 _N_ D _ERROR_ D Name K Height K Weight K BMI K 1 1 Barbara 61 . .
  26. 26. EXECUTION PHASE data ex1; infile 'C:Arthurexample1.txt' ; input name $ 1 - 7 height 9 - 10 weight 12 - 14 ; BMI = 700 *weight/(height*height); output ; run ; PDV 1 st Iteration: B a r b a r a 6 1 1 2 D <ul><li>BMI will remain missing: </li></ul><ul><li>operations on a missing value  a missing value. </li></ul>Input buffer Barbara 61 12D John 62 175 Example1.txt 12345678901234567890 _N_ D _ERROR_ D Name K Height K Weight K BMI K 1 1 Barbara 61 . .
  27. 27. EXECUTION PHASE data ex1; infile 'C:Arthurexample1.txt' ; input name $ 1 - 7 height 9 - 10 weight 12 - 14 ; BMI = 700 *weight/(height*height); output ; run ; PDV 1 st Iteration: B a r b a r a 6 1 1 2 D <ul><li>The OUTPUT statement is executed </li></ul><ul><li>Only values marked with (K) are copied as a single observation to the SAS dataset ex1 </li></ul>Ex1: Input buffer Barbara 61 12D John 62 175 Example1.txt 12345678901234567890 1 . . 61 Barbara BMI Weight Height Name _N_ D _ERROR_ D Name K Height K Weight K BMI K 1 1 Barbara 61 . .
  28. 28. EXECUTION PHASE data ex1; infile 'C:Arthurexample1.txt' ; input name $ 1 - 7 height 9 - 10 weight 12 - 14 ; BMI = 700 *weight/(height*height); output ; run ; PDV 1 st Iteration: B a r b a r a 6 1 1 2 D <ul><li>At the end of the DATA step, two things occur automatically: </li></ul>Ex1: Input buffer Barbara 61 12D John 62 175 Example1.txt 12345678901234567890 1 . . 61 Barbara BMI Weight Height Name _N_ D _ERROR_ D Name K Height K Weight K BMI K 1 1 Barbara 61 . .
  29. 29. EXECUTION PHASE data ex1; infile 'C:Arthurexample1.txt' ; input name $ 1 - 7 height 9 - 10 weight 12 - 14 ; BMI = 700 *weight/(height*height); output ; run ; PDV 1. The SAS system returns to the beginning of the DATA step Ex1: Input buffer Barbara 61 12D John 62 175 Example1.txt 12345678901234567890 1 . . 61 Barbara BMI Weight Height Name _N_ D _ERROR_ D Name K Height K Weight K BMI K 1 1 Barbara 61 . .
  30. 30. EXECUTION PHASE data ex1; infile 'C:Arthurexample1.txt' ; input name $ 1 - 7 height 9 - 10 weight 12 - 14 ; BMI = 700 *weight/(height*height); output ; run ; PDV 2. The values of the variables in the PDV are reset to missing _N_ ↑ 2 _ERROR_  0 Ex1: Input buffer Barbara 61 12D John 62 175 Example1.txt 12345678901234567890 1 . . 61 Barbara BMI Weight Height Name _N_ D _ERROR_ D Name K Height K Weight K BMI K 2 0 . . .
  31. 31. EXECUTION PHASE data ex1; infile 'C:Arthurexample1.txt' ; input name $ 1 - 7 height 9 - 10 weight 12 - 14 ; BMI = 700 *weight/(height*height); output ; run ; PDV 2 nd Iteration: J o h n 6 2 1 7 5 <ul><li>2 nd data line  input buffer </li></ul><ul><li>The input pointer @ beginning of the input buffer </li></ul>Ex1: Input buffer Barbara 61 12D John 62 175 Example1.txt 12345678901234567890 1 . . 61 Barbara BMI Weight Height Name _N_ D _ERROR_ D Name K Height K Weight K BMI K 2 0 . . .
  32. 32. EXECUTION PHASE data ex1; infile 'C:Arthurexample1.txt' ; input name $ 1 - 7 height 9 - 10 weight 12 - 14 ; BMI = 700 *weight/(height*height); output ; run ; PDV 2 nd Iteration: J o h n 6 2 1 7 5 <ul><li>The INPUT statement is executed </li></ul>Ex1: Input buffer Barbara 61 12D John 62 175 Example1.txt 12345678901234567890 _N_ D _ERROR_ D Name K Height K Weight K BMI K 2 0 . 62 John 175 1 . . 61 Barbara BMI Weight Height Name
  33. 33. EXECUTION PHASE data ex1; infile 'C:Arthurexample1.txt' ; input name $ 1 - 7 height 9 - 10 weight 12 - 14 ; BMI = 700 *weight/(height*height); output ; run ; PDV 2 nd Iteration: J o h n 6 2 1 7 5 <ul><li>BMI is calculated </li></ul>Ex1: Input buffer Barbara 61 12D John 62 175 Example1.txt 12345678901234567890 _N_ D _ERROR_ D Name K Height K Weight K BMI K 2 0 31.8678 62 John 175 1 . . 61 Barbara BMI Weight Height Name
  34. 34. EXECUTION PHASE data ex1; infile 'C:Arthurexample1.txt' ; input name $ 1 - 7 height 9 - 10 weight 12 - 14 ; BMI = 700 *weight/(height*height); output ; run ; PDV 2 nd Iteration: J o h n 6 2 1 7 5 <ul><li>The OUTPUT statement is executed </li></ul>Ex1: Input buffer Barbara 61 12D John 62 175 Example1.txt 12345678901234567890 _N_ D _ERROR_ D Name K Height K Weight K BMI K 2 0 31.8678 62 John 175 2 1 31.8678 175 62 John . . 61 Barbara BMI Weight Height Name
  35. 35. EXECUTION PHASE data ex1; infile 'C:Arthurexample1.txt' ; input name $ 1 - 7 height 9 - 10 weight 12 - 14 ; BMI = 700 *weight/(height*height); output ; run ; PDV 2 nd Iteration: J o h n 6 2 1 7 5 <ul><li>At the end of the DATA step, two things occur automatically: </li></ul>Ex1: Input buffer Barbara 61 12D John 62 175 Example1.txt 12345678901234567890 _N_ D _ERROR_ D Name K Height K Weight K BMI K 2 0 31.8678 62 John 175 2 1 31.8678 175 62 John . . 61 Barbara BMI Weight Height Name
  36. 36. EXECUTION PHASE data ex1; infile 'C:Arthurexample1.txt' ; input name $ 1 - 7 height 9 - 10 weight 12 - 14 ; BMI = 700 *weight/(height*height); output ; run ; PDV Ex1: 1. The SAS system returns to the beginning of the DATA step Input buffer Barbara 61 12D John 62 175 Example1.txt 12345678901234567890 _N_ D _ERROR_ D Name K Height K Weight K BMI K 2 0 31.8678 62 John 175 2 1 31.8678 175 62 John . . 61 Barbara BMI Weight Height Name
  37. 37. EXECUTION PHASE data ex1; infile 'C:Arthurexample1.txt' ; input name $ 1 - 7 height 9 - 10 weight 12 - 14 ; BMI = 700 *weight/(height*height); output ; run ; PDV Ex1: 2. The values of the variables in the PDV are reset to missing _N_ ↑ 3 Input buffer Barbara 61 12D John 62 175 Example1.txt 12345678901234567890 2 1 31.8678 175 62 John . . 61 Barbara BMI Weight Height Name _N_ D _ERROR_ D Name K Height K Weight K BMI K 3 0 . . .
  38. 38. EXECUTION PHASE data ex1; infile 'C:Arthurexample1.txt' ; input name $ 1 - 7 height 9 - 10 weight 12 - 14 ; BMI = 700 *weight/(height*height); output ; run ; proc print data =ex1; run ; <ul><li>There are no more records to read </li></ul><ul><li>The SAS system  next DATA/PROC step </li></ul>
  39. 39. THE OUTPUT STATEMENT data ex1; set example1; BMI = 700 *weight/(height*height); run ; <ul><li>The explicit OUTPUT statement: </li></ul><ul><li>write the current observation from the PDV to a SAS dataset immediately </li></ul><ul><li>not at the end of the DATA step </li></ul>output ;
  40. 40. THE OUTPUT STATEMENT data ex1; set example1; BMI = 700 *weight/(height*height); run ; <ul><li>It tells SAS to write observations to the dataset at the end of the DATA step </li></ul><ul><li>The implicit OUTPUT statement: </li></ul>
  41. 41. THE OUTPUT STATEMENT <ul><li>Using explicit OUTPUT will override the implicit OUTPUT </li></ul><ul><li>We can use more than one OUTPUT statement in the DATA step </li></ul>
  42. 42. THE DIFFERENCE BETWEEN READING A RAW DATASET AND READING A SAS DATASET data ex1; infile 'C:Arthurexample1.txt' ; input name $ 1 - 7 height 9 - 10 weight 12 - 14 ; BMI = 700 *weight/(height*height); output ; run ; SAS dataset <ul><li>When Reading a raw dataset … </li></ul>Input buffer PDV _N_ D _ERROR_ D Name K Height K Weight K BMI K Barbara 61 12D John 62 175 Raw data 2 1 31.8678 175 62 John . . 61 Barbara BMI Weight Height Name
  43. 43. THE DIFFERENCE BETWEEN READING A RAW DATASET AND READING A SAS DATASET data ex1; set example1; BMI = 700 *weight/(height*height); output ; run ; SAS dataset <ul><li>When Reading a SAS dataset … </li></ul>SAS dataset Input dataset: Example1 (after “set”) Output dataset: Ex1 (after “data”) PDV _N_ D _ERROR_ D Name K Height K Weight K BMI K 2 1 175 62 John . 61 Barbara Weight Height Name 2 1 31.8678 175 62 John . . 61 Barbara BMI Weight Height Name
  44. 44. THE DIFFERENCE BETWEEN READING A RAW DATASET AND READING A SAS DATASET <ul><li>When reading a raw dataset, SAS sets each variable value in the PDV to missing at the beginning of each iteration of execution, except for … </li></ul><ul><li>the automatic variables </li></ul><ul><li>variables that are named in the RETAIN or SUM statement </li></ul><ul><li>data elements in a _TEMPORARY_ array </li></ul><ul><li>variables created in the options of the FILE/INFILE statement </li></ul>
  45. 45. THE DIFFERENCE BETWEEN READING A RAW DATASET AND READING A SAS DATASET data ex1; set example1; BMI = 700 *weight/(height*height); output ; run ; PDV <ul><li>1 st Iteration: </li></ul><ul><li>At the beginning of the execution phase, SAS sets each variable to missing in the PDV </li></ul><ul><li>When Reading a SAS dataset … </li></ul>Example1: 2 1 175 62 John 170 61 Barbara Weight Height Name _N_ D _ERROR_ D Name K Height K Weight K BMI K 1 0 . . .
  46. 46. THE DIFFERENCE BETWEEN READING A RAW DATASET AND READING A SAS DATASET data ex1; set example1; BMI = 700 *weight/(height*height); output ; run ; PDV <ul><li>1 st Iteration: </li></ul><ul><li>The SET statement is executed </li></ul><ul><li>When Reading a SAS dataset … </li></ul>Example1: 2 1 175 62 John 170 61 Barbara Weight Height Name _N_ D _ERROR_ D Name K Height K Weight K BMI K 1 0 . Barbara 170 61
  47. 47. THE DIFFERENCE BETWEEN READING A RAW DATASET AND READING A SAS DATASET data ex1; set example1; BMI = 700 *weight/(height*height); output ; run ; PDV <ul><li>1 st Iteration: </li></ul><ul><li>BMI is calculated </li></ul><ul><li>When Reading a SAS dataset … </li></ul>Example1: 2 1 175 62 John 170 61 Barbara Weight Height Name _N_ D _ERROR_ D Name K Height K Weight K BMI K 1 0 Barbara 31.9807 170 61
  48. 48. THE DIFFERENCE BETWEEN READING A RAW DATASET AND READING A SAS DATASET data ex1; set example1; BMI = 700 *weight/(height*height); output ; run ; PDV <ul><li>1 st Iteration: </li></ul><ul><li>Output statement is executed </li></ul><ul><li>When Reading a SAS dataset … </li></ul>Example1: Ex1: 2 1 175 62 John 170 61 Barbara Weight Height Name 170 Weight 1 31.9807 61 Barbara BMI Height Name _N_ D _ERROR_ D Name K Height K Weight K BMI K 1 0 Barbara 31.9807 170 61
  49. 49. THE DIFFERENCE BETWEEN READING A RAW DATASET AND READING A SAS DATASET data ex1; set example1; BMI = 700 *weight/(height*height); output ; run ; PDV 2 nd Iteration: <ul><li>When Reading a SAS dataset … </li></ul>Example1: Ex1: Variables exist in the input dataset <ul><li>SAS sets each variable to missing in the PDV only before the 1 st iteration of the execution </li></ul><ul><li>Variables will retain their values in the PDV until they are replaced by the new values </li></ul>2 1 175 62 John 170 61 Barbara Weight Height Name 170 Weight 1 31.9807 61 Barbara BMI Height Name _N_ D _ERROR_ D Name K Height K Weight K BMI K 2 0 Barbara . 170 61
  50. 50. THE DIFFERENCE BETWEEN READING A RAW DATASET AND READING A SAS DATASET data ex1; set example1; BMI = 700 *weight/(height*height); output ; run ; PDV 2 nd Iteration: <ul><li>When Reading a SAS dataset … </li></ul>Example1: Ex1: Variables being created in the DATA step <ul><li>SAS sets each variable to missing in the PDV at the beginning of every iteration of the execution </li></ul>2 1 175 62 John 170 61 Barbara Weight Height Name 170 Weight 1 31.9807 61 Barbara BMI Height Name _N_ D _ERROR_ D Name K Height K Weight K BMI K 2 0 Barbara . 170 61
  51. 51. THE DIFFERENCE BETWEEN READING A RAW DATASET AND READING A SAS DATASET data ex1; set example1; BMI = 700 *weight/(height*height); output ; run ; PDV <ul><li>2 nd Iteration: </li></ul><ul><li>SET statement is executed </li></ul><ul><li>When Reading a SAS dataset … </li></ul>Example1: Ex1: 2 1 175 62 John 170 61 Barbara Weight Height Name 170 Weight 1 31.9807 61 Barbara BMI Height Name _N_ D _ERROR_ D Name K Height K Weight K BMI K 2 0 John . 175 62
  52. 52. THE RETAIN STATEMENT Consider the following dataset: <ul><li>We would like to create a new variable that accumulates the values of SCORE </li></ul>4 A03 3 . A02 2 3 A01 1 SCORE ID 7 3 3 TOTAL
  53. 53. THE RETAIN STATEMENT Consider the following dataset: <ul><li>How to do it? </li></ul><ul><ul><li>Set the TOTAL to 0 at the first iteration of the execution </li></ul></ul><ul><ul><li>Then at each iteration of the execution, add values from SCORE to TOTAL </li></ul></ul>Problem : TOTAL is a new variable that you want to create  TOTAL will be set to missing in the PDV at the beginning of every iteration of the execution. 4 A03 3 . A02 2 3 A01 1 SCORE ID 7 3 3 TOTAL
  54. 54. THE RETAIN STATEMENT <ul><li>To fix this problem, we can use the RETAIN statement: </li></ul>RETAIN VARIABLE <VALUE>; <ul><li>Prevents the VARIABLE from being initialized each time the DATA step executes </li></ul>
  55. 55. THE RETAIN STATEMENT <ul><li>To fix this problem, we can use the RETAIN statement: </li></ul>RETAIN VARIABLE <VALUE>; Name of the variable that we will want to retain <ul><li>A numeric value </li></ul><ul><li>Used to initialize the VARIABLE only at the first iteration of the DATA step execution </li></ul><ul><li>Not specifying an initial value  VARIABLE is initialized as missing </li></ul>
  56. 56. THE RETAIN STATEMENT data ex2_2; set ex2; retain total 0 ; total = sum(total, score); run ; PDV <ul><li>The execution phase begins immediately after the completion of the compilation phase </li></ul>_N_ D _ERROR_ D ID K Total K 4 A03 3 . A02 2 3 A01 1 SCORE ID Score K
  57. 57. THE RETAIN STATEMENT data ex2_2; set ex2; retain total 0 ; total = sum(total, score); run ; PDV <ul><li>_N_  1, _ERROR_  0 </li></ul><ul><li>ID, SCORE  missing </li></ul><ul><li>TOTAL  0 because of the RETAIN </li></ul>1 st Iteration: 4 A03 3 . A02 2 3 A01 1 SCORE ID _N_ D _ERROR_ D ID K Total K Score K 1 0 . 0
  58. 58. THE RETAIN STATEMENT data ex2_2; set ex2; retain total 0 ; total = sum(total, score); run ; PDV <ul><li>1 st observation from ex2  PDV. </li></ul>1 st Iteration: 4 A03 3 . A02 2 3 A01 1 SCORE ID _N_ D _ERROR_ D ID K Total K Score K 1 0 3 0 A01
  59. 59. THE RETAIN STATEMENT data ex2_2; set ex2; retain total 0 ; total = sum(total, score); run ; PDV <ul><li>The RETAIN statement is a compile-time only statement </li></ul><ul><li>It does not execute during the execution phase </li></ul>1 st Iteration: 4 A03 3 . A02 2 3 A01 1 SCORE ID _N_ D _ERROR_ D ID K Total K Score K 1 0 3 0 A01
  60. 60. THE RETAIN STATEMENT data ex2_2; set ex2; retain total 0 ; total = sum(total, score); run ; PDV <ul><li>TOTAL is calculated </li></ul>1 st Iteration: 4 A03 3 . A02 2 3 A01 1 SCORE ID _N_ D _ERROR_ D ID K Total K Score K 1 0 3 3 A01
  61. 61. THE RETAIN STATEMENT data ex2_2; set ex2; retain total 0 ; total = sum(total, score); run ; PDV <ul><li>The implicit OUTPUT statement tells the SAS system to write observations to the dataset </li></ul>1 st Iteration: Ex2_2: 4 A03 3 . A02 2 3 A01 1 SCORE ID 3 SCORE 3 TOTAL A01 1 ID _N_ D _ERROR_ D ID K Total K Score K 1 0 3 3 A01
  62. 62. THE RETAIN STATEMENT data ex2_2; set ex2; retain total 0 ; total = sum(total, score); run ; PDV <ul><li>_N_ ↑ 2 </li></ul><ul><li>ID and SCORE are retained from the previous iteration because data are read from an existing SAS dataset </li></ul><ul><li>TOTAL is also retained because the RETAIN statement is used </li></ul>2 nd Iteration: Ex2_2: 4 A03 3 . A02 2 3 A01 1 SCORE ID 3 SCORE 3 TOTAL A01 1 ID _N_ D _ERROR_ D ID K Total K Score K 2 0 3 3 A01
  63. 63. THE RETAIN STATEMENT data ex2_2; set ex2; retain total 0 ; total = sum(total, score); run ; PDV <ul><li>2 nd observation from ex2  PDV </li></ul>2 nd Iteration: Ex2_2: 4 A03 3 . A02 2 3 A01 1 SCORE ID 3 SCORE 3 TOTAL A01 1 ID _N_ D _ERROR_ D ID K Total K Score K 2 0 . 3 A02
  64. 64. THE RETAIN STATEMENT data ex2_2; set ex2; retain total 0 ; total = sum(total, score); run ; PDV <ul><li>TOTAL is calculated </li></ul>2 nd Iteration: Ex2_2: 4 A03 3 . A02 2 3 A01 1 SCORE ID 3 SCORE 3 TOTAL A01 1 ID _N_ D _ERROR_ D ID K Total K Score K 2 0 . 3 A02
  65. 65. THE RETAIN STATEMENT data ex2_2; set ex2; retain total 0 ; total = sum(total, score); run ; PDV 2 nd Iteration: Ex2_2: <ul><li>The implicit OUTPUT: </li></ul><ul><li>The contents in PDV  Ex2_2 </li></ul>4 A03 3 . A02 2 3 A01 1 SCORE ID . 3 SCORE 3 3 TOTAL A02 2 A01 1 ID _N_ D _ERROR_ D ID K Total K Score K 2 0 . 3 A02
  66. 66. THE RETAIN STATEMENT data ex2_2; set ex2; retain total 0 ; total = sum(total, score); run ; PDV 3 rd Iteration: Ex2_2: <ul><li>_N_ ↑ 3. </li></ul><ul><li>ID and SCORE are retained from the previous iteration. </li></ul><ul><li>TOTAL is also retained. </li></ul>4 A03 3 . A02 2 3 A01 1 SCORE ID . 3 SCORE 3 3 TOTAL A02 2 A01 1 ID _N_ D _ERROR_ D ID K Total K Score K 3 0 . 3 A02
  67. 67. THE RETAIN STATEMENT data ex2_2; set ex2; retain total 0 ; total = sum(total, score); run ; PDV 3 rd Iteration: Ex2_2: <ul><li>3 rd observation from ex2  PDV </li></ul>4 A03 3 . A02 2 3 A01 1 SCORE ID . 3 SCORE 3 3 TOTAL A02 2 A01 1 ID _N_ D _ERROR_ D ID K Total K Score K 3 0 4 3 A03
  68. 68. THE RETAIN STATEMENT data ex2_2; set ex2; retain total 0 ; total = sum(total, score); run ; PDV 3 rd Iteration: Ex2_2: <ul><li>TOTAL is calculated </li></ul>4 A03 3 . A02 2 3 A01 1 SCORE ID . 3 SCORE 3 3 TOTAL A02 2 A01 1 ID _N_ D _ERROR_ D ID K Total K Score K 3 0 4 7 A03
  69. 69. THE RETAIN STATEMENT data ex2_2; set ex2; retain total 0 ; total = sum(total, score); run ; PDV 3 rd Iteration: Ex2_2: <ul><li>The implicit OUTPUT: </li></ul><ul><li>The contents in PDV  Ex2_2 </li></ul>4 A03 3 . A02 2 3 A01 1 SCORE ID 4 . 3 SCORE 7 3 3 TOTAL A03 3 A02 2 A01 1 ID _N_ D _ERROR_ D ID K Total K Score K 3 0 4 7 A03
  70. 70. THE SUM STATEMENT <ul><li>The SUM statement has the following form: </li></ul>VARIABLE + EXPRESSION; <ul><li>The numeric accumulator variable that is to be created </li></ul><ul><li>It is automatically set to 0 at the beginning of the first iteration of the DATA step execution </li></ul><ul><li>Retained in following iterations </li></ul><ul><li>Any SAS expression </li></ul><ul><li>If EXPRESSION is evaluated to a missing value, it is treated as 0 </li></ul>
  71. 71. THE SUM STATEMENT data ex2_2; set ex2; run ; retain total 0 ; total = sum(total, score); The previous program can be re-written as…
  72. 72. THE SUM STATEMENT data ex2_2; set ex2; run ; The previous program can be re-written as… total + score;
  73. 73. THE SUBSETTING IF STATEMENT <ul><li>We use the subsetting IF statement to continue processing only the observations that meet the condition of the specified expression </li></ul>IF EXPRESSION ; <ul><li>If EXPRESSION is true for the observation, </li></ul><ul><ul><li>SAS continues to execute statements in the DATA step </li></ul></ul><ul><ul><li>includes the current observation in the data set </li></ul></ul>
  74. 74. THE SUBSETTING IF STATEMENT <ul><li>Use the IF statement to continue processing only the observations that meet the condition of the specified expression </li></ul>IF EXPRESSION ; <ul><li>If EXPRESSION is false for the observation, </li></ul><ul><ul><li>no further statements are processed for that obs. </li></ul></ul><ul><ul><li>SAS immediately returns to the beginning of DATA step </li></ul></ul><ul><ul><li>the remaining program statements in the DATA step are not executed and the current observation is not written to the output data set </li></ul></ul>
  75. 75. THE BY-GROUP PROCESSING IN THE DATA STEP One observation per subject Multiple observations per subject -- Longitudinal data <ul><li>Identify the beginning/end of measurement for each subject </li></ul><ul><li>This can be accomplished by using the BY-group processing method </li></ul>4 A03 3 . A02 2 3 A01 1 SCORE ID 2 A02 5 4 A02 4 2 A01 3 4 A01 2 3 A01 1 SCORE ID
  76. 76. THE BY-GROUP PROCESSING IN THE DATA STEP <ul><li>SAS locates the beginning and end of a BY-group by creating two temporary indicator variables for each BY variable: </li></ul><ul><ul><li>FIRST.VARIABLE </li></ul></ul><ul><ul><li>LAST.VARIABLE </li></ul></ul><ul><li>Suppose ID is the “BY” variable: </li></ul>SAS reads the 1 st observation for ID = A01 SAS reads the last observation for ID = A01 2 A02 5 4 A02 4 2 A01 3 4 A01 2 3 A01 1 SCORE ID 0 1 0 0 1 FIRST.ID 1 0 1 0 0 LAST.ID
  77. 77. THE BY-GROUP PROCESSING IN THE DATA STEP <ul><li>Calculating the total scores for each subject </li></ul>proc sort data =ex3; by id; run ; data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0 ; total + score; if last.id = 1 ; run ; 2 A02 5 4 A02 4 2 A01 3 4 A01 2 3 A01 1 SCORE ID 6 A02 2 9 A01 1 TOTAL ID
  78. 78. THE BY-GROUP PROCESSING IN THE DATA STEP data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0 ; total + score; if last.id = 1 ; run ; PDV 1 st iteration: <ul><li>_N_  1, _ERROR_  0 </li></ul><ul><li>FIRST.ID  1, LAST.ID  1 only at beginning of 1 st iteration </li></ul><ul><li>ID, Score  missing </li></ul><ul><li>TOTAL  0 because of the SUM statement </li></ul>2 A02 5 4 A02 4 2 A01 3 4 A01 2 3 A01 1 SCORE ID _N_ D _ERROR_ D ID K Total K Score D FIRST.ID D LAST.ID D 1 0 1 1 . 0
  79. 79. THE BY-GROUP PROCESSING IN THE DATA STEP data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0 ; total + score; if last.id = 1 ; run ; PDV <ul><li>The SET statement is executed </li></ul><ul><li>1 st observation  PDV </li></ul><ul><li>FIRST.ID  1 and LAST.ID  0 </li></ul>1 st iteration: 2 A02 5 4 A02 4 2 A01 3 4 A01 2 3 A01 1 SCORE ID _N_ D _ERROR_ D ID K Total K Score D FIRST.ID D LAST.ID D 1 0 1 0 A01 3 0
  80. 80. THE BY-GROUP PROCESSING IN THE DATA STEP data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0 ; total + score; if last.id = 1 ; run ; PDV <ul><li>FIRST.ID = 1: TOTAL  0 </li></ul>1 st iteration: 2 A02 5 4 A02 4 2 A01 3 4 A01 2 3 A01 1 SCORE ID _N_ D _ERROR_ D ID K Total K Score D FIRST.ID D LAST.ID D 1 0 1 0 A01 3 0
  81. 81. THE BY-GROUP PROCESSING IN THE DATA STEP data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0 ; total + score; if last.id = 1 ; run ; PDV <ul><li>TOTAL is accumulated </li></ul>1 st iteration: 2 A02 5 4 A02 4 2 A01 3 4 A01 2 3 A01 1 SCORE ID _N_ D _ERROR_ D ID K Total K Score D FIRST.ID D LAST.ID D 1 0 1 0 A01 3 3
  82. 82. THE BY-GROUP PROCESSING IN THE DATA STEP data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0 ; total + score; if last.id = 1 ; run ; PDV <ul><li>The subsetting IF statement is evaluated to be FALSE because LAST.ID ≠ 1 </li></ul><ul><li>SAS returns to the beginning of the DATA step to begin the 2 nd iteration </li></ul>1 st iteration: 2 A02 5 4 A02 4 2 A01 3 4 A01 2 3 A01 1 SCORE ID _N_ D _ERROR_ D ID K Total K Score D FIRST.ID D LAST.ID D 1 0 1 0 A01 3 3
  83. 83. THE BY-GROUP PROCESSING IN THE DATA STEP data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0 ; total + score; if last.id = 1 ; run ; PDV <ul><li>_N_ ↑ 2 </li></ul><ul><li>The values for the rest of the variables are retained </li></ul>2 nd iteration: 2 A02 5 4 A02 4 2 A01 3 4 A01 2 3 A01 1 SCORE ID _N_ D _ERROR_ D ID K Total K Score D FIRST.ID D LAST.ID D 2 0 1 0 A01 3 3
  84. 84. THE BY-GROUP PROCESSING IN THE DATA STEP data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0 ; total + score; if last.id = 1 ; run ; PDV <ul><li>2 nd observation  PDV </li></ul><ul><li>Not the first observation for A01: FIRST.ID  0 </li></ul><ul><li>Not the last observation for A01: LAST.ID  0 </li></ul>2 nd iteration: 2 A02 5 4 A02 4 2 A01 3 4 A01 2 3 A01 1 SCORE ID _N_ D _ERROR_ D ID K Total K Score D FIRST.ID D LAST.ID D 2 0 0 0 A01 4 3
  85. 85. THE BY-GROUP PROCESSING IN THE DATA STEP data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0 ; total + score; if last.id = 1 ; run ; PDV <ul><li>FIRST.ID ≠ 1: no execution </li></ul>2 nd iteration: 2 A02 5 4 A02 4 2 A01 3 4 A01 2 3 A01 1 SCORE ID _N_ D _ERROR_ D ID K Total K Score D FIRST.ID D LAST.ID D 2 0 0 0 A01 4 3
  86. 86. THE BY-GROUP PROCESSING IN THE DATA STEP data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0 ; total + score; if last.id = 1 ; run ; PDV <ul><li>TOTAL is accumulated </li></ul>2 nd iteration: 2 A02 5 4 A02 4 2 A01 3 4 A01 2 3 A01 1 SCORE ID _N_ D _ERROR_ D ID K Total K Score D FIRST.ID D LAST.ID D 2 0 0 0 A01 4 7
  87. 87. THE BY-GROUP PROCESSING IN THE DATA STEP data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0 ; total + score; if last.id = 1 ; run ; PDV <ul><li>The subsetting IF statement is evaluated to be FALSE because LAST.ID ≠ 1 </li></ul><ul><li>SAS returns to the beginning of the DATA step to begin the 3 rd iteration </li></ul>2 nd iteration: 2 A02 5 4 A02 4 2 A01 3 4 A01 2 3 A01 1 SCORE ID _N_ D _ERROR_ D ID K Total K Score D FIRST.ID D LAST.ID D 2 0 0 0 A01 4 7
  88. 88. THE BY-GROUP PROCESSING IN THE DATA STEP data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0 ; total + score; if last.id = 1 ; run ; PDV <ul><li>_N_ ↑ 3 </li></ul><ul><li>The values for the rest of the variables are retained </li></ul>3 rd iteration: 2 A02 5 4 A02 4 2 A01 3 4 A01 2 3 A01 1 SCORE ID _N_ D _ERROR_ D ID K Total K Score D FIRST.ID D LAST.ID D 3 0 0 0 A01 4 7
  89. 89. THE BY-GROUP PROCESSING IN THE DATA STEP data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0 ; total + score; if last.id = 1 ; run ; PDV <ul><li>3 rd observation  PDV </li></ul><ul><li>Not the first observation: FIRST.ID  0 </li></ul><ul><li>Last observation for A01: LAST.ID  1 </li></ul>3 rd iteration: 2 A02 5 4 A02 4 2 A01 3 4 A01 2 3 A01 1 SCORE ID _N_ D _ERROR_ D ID K Total K Score D FIRST.ID D LAST.ID D 3 0 0 1 A01 2 7
  90. 90. THE BY-GROUP PROCESSING IN THE DATA STEP data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0 ; total + score; if last.id = 1 ; run ; PDV <ul><li>FIRST.ID ≠ 1: no execution </li></ul>3 rd iteration: 2 A02 5 4 A02 4 2 A01 3 4 A01 2 3 A01 1 SCORE ID _N_ D _ERROR_ D ID K Total K Score D FIRST.ID D LAST.ID D 3 0 0 1 A01 2 7
  91. 91. THE BY-GROUP PROCESSING IN THE DATA STEP data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0 ; total + score; if last.id = 1 ; run ; PDV <ul><li>TOTAL is calculated </li></ul>3 rd iteration: 2 A02 5 4 A02 4 2 A01 3 4 A01 2 3 A01 1 SCORE ID _N_ D _ERROR_ D ID K Total K Score D FIRST.ID D LAST.ID D 3 0 0 1 A01 2 9
  92. 92. THE BY-GROUP PROCESSING IN THE DATA STEP data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0 ; total + score; if last.id = 1 ; run ; PDV <ul><li>The subsetting IF statement is evaluated to be TRUE </li></ul>3 rd iteration: 2 A02 5 4 A02 4 2 A01 3 4 A01 2 3 A01 1 SCORE ID _N_ D _ERROR_ D ID K Total K Score D FIRST.ID D LAST.ID D 3 0 0 1 A01 2 9
  93. 93. THE BY-GROUP PROCESSING IN THE DATA STEP data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0 ; total + score; if last.id = 1 ; run ; PDV <ul><li>SAS reaches the end of the 3 rd iteration </li></ul><ul><li>The implicit OUTPUT executes </li></ul><ul><li>SAS returns to the beginning of the DATA step to begin the 3 rd iteration </li></ul>Ex3_1: 3 rd iteration: 2 A02 5 4 A02 4 2 A01 3 4 A01 2 3 A01 1 SCORE ID 9 A01 1 TOTAL ID _N_ D _ERROR_ D ID K Total K Score D FIRST.ID D LAST.ID D 3 0 0 1 A01 2 9
  94. 94. THE BY-GROUP PROCESSING IN THE DATA STEP data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0 ; total + score; if last.id = 1 ; run ; PDV <ul><li>_N_ ↑ 4 </li></ul><ul><li>The values for the remaining variables are retained </li></ul>Ex3_1: 4 th iteration: 2 A02 5 4 A02 4 2 A01 3 4 A01 2 3 A01 1 SCORE ID 9 A01 1 TOTAL ID _N_ D _ERROR_ D ID K Total K Score D FIRST.ID D LAST.ID D 4 0 0 1 A01 2 9
  95. 95. THE BY-GROUP PROCESSING IN THE DATA STEP data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0 ; total + score; if last.id = 1 ; run ; PDV <ul><li>4 th observation  PDV </li></ul><ul><li>FIRST.ID  1 </li></ul><ul><li>LAST.ID  0 </li></ul>Ex3_1: 4 th iteration: 2 A02 5 4 A02 4 2 A01 3 4 A01 2 3 A01 1 SCORE ID 9 A01 1 TOTAL ID _N_ D _ERROR_ D ID K Total K Score D FIRST.ID D LAST.ID D 4 0 1 0 A02 4 9
  96. 96. THE BY-GROUP PROCESSING IN THE DATA STEP data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0 ; total + score; if last.id = 1 ; run ; PDV <ul><li>FIRST.ID = 1: TOTAL  0 </li></ul>Ex3_1: 4 th iteration: 2 A02 5 4 A02 4 2 A01 3 4 A01 2 3 A01 1 SCORE ID 9 A01 1 TOTAL ID _N_ D _ERROR_ D ID K Total K Score D FIRST.ID D LAST.ID D 4 0 1 0 A02 4 0
  97. 97. THE BY-GROUP PROCESSING IN THE DATA STEP data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0 ; total + score; if last.id = 1 ; run ; PDV <ul><li>TOTAL is calculated </li></ul>Ex3_1: 4 th iteration: 2 A02 5 4 A02 4 2 A01 3 4 A01 2 3 A01 1 SCORE ID 9 A01 1 TOTAL ID _N_ D _ERROR_ D ID K Total K Score D FIRST.ID D LAST.ID D 4 0 1 0 A02 4 4
  98. 98. THE BY-GROUP PROCESSING IN THE DATA STEP data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0 ; total + score; if last.id = 1 ; run ; PDV Ex3_1: <ul><li>The subsetting IF statement is evaluated to be FALSE </li></ul><ul><li>SAS returns to the beginning of the DATA step to begin the 5 th iteration </li></ul>4 th iteration: 2 A02 5 4 A02 4 2 A01 3 4 A01 2 3 A01 1 SCORE ID 9 A01 1 TOTAL ID _N_ D _ERROR_ D ID K Total K Score D FIRST.ID D LAST.ID D 4 0 1 0 A02 4 4
  99. 99. THE BY-GROUP PROCESSING IN THE DATA STEP data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0 ; total + score; if last.id = 1 ; run ; PDV <ul><li>_N_ ↑ 5 </li></ul><ul><li>The values for the remaining variables are retained </li></ul>Ex3_1: 5 th iteration: 2 A02 5 4 A02 4 2 A01 3 4 A01 2 3 A01 1 SCORE ID 9 A01 1 TOTAL ID _N_ D _ERROR_ D ID K Total K Score D FIRST.ID D LAST.ID D 5 0 1 0 A02 4 4
  100. 100. THE BY-GROUP PROCESSING IN THE DATA STEP data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0 ; total + score; if last.id = 1 ; run ; PDV <ul><li>5 th observation  PDV </li></ul><ul><li>FIRST.ID  0 </li></ul><ul><li>LAST.ID  1 </li></ul>Ex3_1: 5 th iteration: 2 A02 5 4 A02 4 2 A01 3 4 A01 2 3 A01 1 SCORE ID 9 A01 1 TOTAL ID _N_ D _ERROR_ D ID K Total K Score D FIRST.ID D LAST.ID D 5 0 0 1 A02 2 4
  101. 101. THE BY-GROUP PROCESSING IN THE DATA STEP data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0 ; total + score; if last.id = 1 ; run ; PDV <ul><li>FIRST.ID ≠ 1: no execution </li></ul>Ex3_1: 5 th iteration: 2 A02 5 4 A02 4 2 A01 3 4 A01 2 3 A01 1 SCORE ID 9 A01 1 TOTAL ID _N_ D _ERROR_ D ID K Total K Score D FIRST.ID D LAST.ID D 5 0 0 1 A02 2 4
  102. 102. THE BY-GROUP PROCESSING IN THE DATA STEP data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0 ; total + score; if last.id = 1 ; run ; PDV <ul><li>TOTAL is calculated </li></ul>Ex3_1: 5 th iteration: 2 A02 5 4 A02 4 2 A01 3 4 A01 2 3 A01 1 SCORE ID 9 A01 1 TOTAL ID _N_ D _ERROR_ D ID K Total K Score D FIRST.ID D LAST.ID D 5 0 0 1 A02 2 6
  103. 103. THE BY-GROUP PROCESSING IN THE DATA STEP data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0 ; total + score; if last.id = 1 ; run ; PDV <ul><li>The subsetting IF statement is evaluated to be TRUE </li></ul>Ex3_1: 5 th iteration: 2 A02 5 4 A02 4 2 A01 3 4 A01 2 3 A01 1 SCORE ID 9 A01 1 TOTAL ID _N_ D _ERROR_ D ID K Total K Score D FIRST.ID D LAST.ID D 5 0 0 1 A02 2 6
  104. 104. THE BY-GROUP PROCESSING IN THE DATA STEP data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0 ; total + score; if last.id = 1 ; run ; PDV Ex3_1: <ul><li>SAS reaches the end of the 5 th iteration </li></ul><ul><li>The implicit OUTPUT executes </li></ul>5 th iteration: 2 A02 5 4 A02 4 2 A01 3 4 A01 2 3 A01 1 SCORE ID 6 A02 2 9 A01 1 TOTAL ID _N_ D _ERROR_ D ID K Total K Score D FIRST.ID D LAST.ID D 5 0 0 1 A02 2 6
  105. 105. RESTRUCTURING DATASETS <ul><li>Restructuring datasets: </li></ul>data with one observation per subject (the wide format) data with multiple observations per subject (the long format) 4 3 S1 . 4 S2 2 A02 2 5 A01 1 S3 ID 3 1 3 2 1 TIME 2 A02 5 4 A02 4 5 A01 3 4 A01 2 3 A01 1 SCORE ID
  106. 106. RESTRUCTURING DATASETS <ul><li>Restructuring datasets: </li></ul>data with one observation per subject (the wide format) data with multiple observations per subject (the long format) S1 – S3 SCORE Distinguish different measurements for each subject 4 3 S1 . 4 S2 2 A02 2 5 A01 1 S3 ID 3 1 3 2 1 TIME 2 A02 5 4 A02 4 5 A01 3 4 A01 2 3 A01 1 SCORE ID
  107. 107. RESTRUCTURING DATASETS <ul><li>The transformation can be easily done by using ARRAY/PROC TRANSPOSE </li></ul><ul><li>(See my paper “The Many Ways to Effectively Utilize Array Processing”, paper 244-2011) </li></ul><ul><li>This can also be accomplished without advanced techniques for more simple cases </li></ul><ul><li>Here is a solution for using multiple OUTPUT statements in one DATA step </li></ul>
  108. 108. FROM WIDE FORMAT TO LONG FORMAT Wide: Long: <ul><li>Transform wide  long </li></ul><ul><li>2 observations to read  2 DATA step iterations </li></ul><ul><li>Use multiple OUTPUT statement </li></ul><ul><li>Any missing values in S1 – S3 will not be outputted to long </li></ul>data long (drop=s1-s3); set wide; time = 1 ; score = s1; if not missing(score) then output ; time = 2 ; score = s2; if not missing(score) then output ; time = 3 ; score = s3; if not missing(score) then output ; run ; 4 3 S1 . 4 S2 2 A02 2 5 A01 1 S3 ID 3 1 3 2 1 TIME 2 A02 5 4 A02 4 5 A01 3 4 A01 2 3 A01 1 SCORE ID
  109. 109. FROM WIDE FORMAT TO LONG FORMAT Wide: data long (drop=s1-s3); set wide; time = 1 ; score = s1; if not missing(score) then output ; time = 2 ; score = s2; if not missing(score) then output ; time = 3 ; score = s3; if not missing(score) then output ; run ; 1 st iteration: <ul><li>_N_  1 </li></ul><ul><li>Other variables  missing </li></ul>4 3 S1 . 4 S2 2 A02 2 5 A01 1 S3 ID K ID . D S1 . D S2 . D S3 . K TIME . K SCORE 1 K _N_
  110. 110. FROM WIDE FORMAT TO LONG FORMAT Wide: data long (drop=s1-s3); set wide; time = 1 ; score = s1; if not missing(score) then output ; time = 2 ; score = s2; if not missing(score) then output ; time = 3 ; score = s3; if not missing(score) then output ; run ; 1 st iteration: <ul><li>1 st observation from the wide  PDV </li></ul>4 3 S1 . 4 S2 2 A02 2 5 A01 1 S3 ID A01 K ID 3 D S1 4 D S2 5 D S3 . K TIME . K SCORE 1 K _N_
  111. 111. FROM WIDE FORMAT TO LONG FORMAT Wide: data long (drop=s1-s3); set wide; time = 1 ; score = s1; if not missing(score) then output ; time = 2 ; score = s2; if not missing(score) then output ; time = 3 ; score = s3; if not missing(score) then output ; run ; 1 st iteration: <ul><li>Time  1 </li></ul>4 3 S1 . 4 S2 2 A02 2 5 A01 1 S3 ID A01 K ID 3 D S1 4 D S2 5 D S3 1 K TIME . K SCORE 1 K _N_
  112. 112. FROM WIDE FORMAT TO LONG FORMAT Wide: data long (drop=s1-s3); set wide; time = 1 ; score = s1; if not missing(score) then output ; time = 2 ; score = s2; if not missing(score) then output ; time = 3 ; score = s3; if not missing(score) then output ; run ; 1 st iteration: <ul><li>Score  value from S1(3) </li></ul>4 3 S1 . 4 S2 2 A02 2 5 A01 1 S3 ID A01 K ID 3 D S1 4 D S2 5 D S3 1 K TIME 3 K SCORE 1 K _N_
  113. 113. FROM WIDE FORMAT TO LONG FORMAT Wide: data long (drop=s1-s3); set wide; time = 1 ; score = s1; if not missing(score) then output ; time = 2 ; score = s2; if not missing(score) then output ; time = 3 ; score = s3; if not missing(score) then output ; run ; 1 st iteration: <ul><li>SCORE ≠ missing : ID, TIME, and SCORE  Long </li></ul>Long: 4 3 S1 . 4 S2 2 A02 2 5 A01 1 S3 ID A01 K ID 3 D S1 4 D S2 5 D S3 1 K TIME 3 K SCORE 1 K _N_ 1 TIME 3 A01 1 SCORE ID
  114. 114. FROM WIDE FORMAT TO LONG FORMAT Wide: data long (drop=s1-s3); set wide; time = 1 ; score = s1; if not missing(score) then output ; time = 2 ; score = s2; if not missing(score) then output ; time = 3 ; score = s3; if not missing(score) then output ; run ; 1 st iteration: <ul><li>TIME  2 </li></ul>Long: 4 3 S1 . 4 S2 2 A02 2 5 A01 1 S3 ID A01 K ID 3 D S1 4 D S2 5 D S3 2 K TIME 3 K SCORE 1 K _N_ 1 TIME 3 A01 1 SCORE ID
  115. 115. FROM WIDE FORMAT TO LONG FORMAT Wide: data long (drop=s1-s3); set wide; time = 1 ; score = s1; if not missing(score) then output ; time = 2 ; score = s2; if not missing(score) then output ; time = 3 ; score = s3; if not missing(score) then output ; run ; 1 st iteration: <ul><li>Score  value from S2(4) </li></ul>Long: 4 3 S1 . 4 S2 2 A02 2 5 A01 1 S3 ID A01 K ID 3 D S1 4 D S2 5 D S3 2 K TIME 4 K SCORE 1 K _N_ 1 TIME 3 A01 1 SCORE ID
  116. 116. FROM WIDE FORMAT TO LONG FORMAT Wide: data long (drop=s1-s3); set wide; time = 1 ; score = s1; if not missing(score) then output ; time = 2 ; score = s2; if not missing(score) then output ; time = 3 ; score = s3; if not missing(score) then output ; run ; 1 st iteration: <ul><li>SCORE ≠ missing : ID, TIME, and SCORE  Long </li></ul>Long: 4 3 S1 . 4 S2 2 A02 2 5 A01 1 S3 ID A01 K ID 3 D S1 4 D S2 5 D S3 2 K TIME 4 K SCORE 1 K _N_ 2 1 TIME 4 A01 2 3 A01 1 SCORE ID
  117. 117. FROM WIDE FORMAT TO LONG FORMAT Wide: data long (drop=s1-s3); set wide; time = 1 ; score = s1; if not missing(score) then output ; time = 2 ; score = s2; if not missing(score) then output ; time = 3 ; score = s3; if not missing(score) then output ; run ; 1 st iteration: <ul><li>TIME  3 </li></ul>Long: 4 3 S1 . 4 S2 2 A02 2 5 A01 1 S3 ID A01 K ID 3 D S1 4 D S2 5 D S3 3 K TIME 4 K SCORE 1 K _N_ 2 1 TIME 4 A01 2 3 A01 1 SCORE ID
  118. 118. FROM WIDE FORMAT TO LONG FORMAT Wide: data long (drop=s1-s3); set wide; time = 1 ; score = s1; if not missing(score) then output ; time = 2 ; score = s2; if not missing(score) then output ; time = 3 ; score = s3; if not missing(score) then output ; run ; 1 st iteration: <ul><li>SCORE  value from S3(5) </li></ul>Long: 4 3 S1 . 4 S2 2 A02 2 5 A01 1 S3 ID A01 K ID 3 D S1 4 D S2 5 D S3 3 K TIME 5 K SCORE 1 K _N_ 2 1 TIME 4 A01 2 3 A01 1 SCORE ID
  119. 119. FROM WIDE FORMAT TO LONG FORMAT Wide: data long (drop=s1-s3); set wide; time = 1 ; score = s1; if not missing(score) then output ; time = 2 ; score = s2; if not missing(score) then output ; time = 3 ; score = s3; if not missing(score) then output ; run ; 1 st iteration: <ul><li>SCORE ≠ missing : ID, TIME, and SCORE  Long </li></ul>Long: 4 3 S1 . 4 S2 2 A02 2 5 A01 1 S3 ID A01 K ID 3 D S1 4 D S2 5 D S3 3 K TIME 5 K SCORE 1 K _N_ 3 2 1 TIME 5 A01 3 4 A01 2 3 A01 1 SCORE ID
  120. 120. FROM WIDE FORMAT TO LONG FORMAT Wide: data long (drop=s1-s3); set wide; time = 1 ; score = s1; if not missing(score) then output ; time = 2 ; score = s2; if not missing(score) then output ; time = 3 ; score = s3; if not missing(score) then output ; run ; 1 st iteration: <ul><li>There is no more implicit OUTPUT statement </li></ul><ul><li>SAS returns to the beginning of the DATA step to begin the 2 nd iteration </li></ul>Long: 4 3 S1 . 4 S2 2 A02 2 5 A01 1 S3 ID A01 K ID 3 D S1 4 D S2 5 D S3 3 K TIME 5 K SCORE 1 K _N_ 3 2 1 TIME 5 A01 3 4 A01 2 3 A01 1 SCORE ID
  121. 121. FROM WIDE FORMAT TO LONG FORMAT Wide: data long (drop=s1-s3); set wide; time = 1 ; score = s1; if not missing(score) then output ; time = 2 ; score = s2; if not missing(score) then output ; time = 3 ; score = s3; if not missing(score) then output ; run ; 2 nd iteration: <ul><li>_N_ ↑2 </li></ul><ul><li>ID and S1-S3 are retained from the previous iteration </li></ul><ul><li>TIME, SCORE  missing </li></ul>Long: 4 3 S1 . 4 S2 2 A02 2 5 A01 1 S3 ID A01 K ID 3 D S1 4 D S2 5 D S3 . K TIME . K SCORE 2 K _N_ 3 2 1 TIME 5 A01 3 4 A01 2 3 A01 1 SCORE ID
  122. 122. FROM WIDE FORMAT TO LONG FORMAT Wide: data long (drop=s1-s3); set wide; time = 1 ; score = s1; if not missing(score) then output ; time = 2 ; score = s2; if not missing(score) then output ; time = 3 ; score = s3; if not missing(score) then output ; run ; 2 nd iteration: <ul><li>2nd observation from the Wide  PDV </li></ul>Long: 4 3 S1 . 4 S2 2 A02 2 5 A01 1 S3 ID A01 K ID 4 D S1 . D S2 2 D S3 . K TIME . K SCORE 2 K _N_ 3 2 1 TIME 5 A01 3 4 A01 2 3 A01 1 SCORE ID
  123. 123. FROM WIDE FORMAT TO LONG FORMAT Wide: data long (drop=s1-s3); set wide; time = 1 ; score = s1; if not missing(score) then output ; time = 2 ; score = s2; if not missing(score) then output ; time = 3 ; score = s3; if not missing(score) then output ; run ; 2 nd iteration: <ul><li>TIME  1 </li></ul>Long: 4 3 S1 . 4 S2 2 A02 2 5 A01 1 S3 ID A01 K ID 4 D S1 . D S2 2 D S3 1 K TIME . K SCORE 2 K _N_ 3 2 1 TIME 5 A01 3 4 A01 2 3 A01 1 SCORE ID
  124. 124. FROM WIDE FORMAT TO LONG FORMAT Wide: data long (drop=s1-s3); set wide; time = 1 ; score = s1; if not missing(score) then output ; time = 2 ; score = s2; if not missing(score) then output ; time = 3 ; score = s3; if not missing(score) then output ; run ; 2 nd iteration: <ul><li>SCORE  value from S1 (4) </li></ul>Long: 4 3 S1 . 4 S2 2 A02 2 5 A01 1 S3 ID A01 K ID 4 D S1 . D S2 2 D S3 1 K TIME 4 K SCORE 2 K _N_ 3 2 1 TIME 5 A01 3 4 A01 2 3 A01 1 SCORE ID
  125. 125. FROM WIDE FORMAT TO LONG FORMAT Wide: data long (drop=s1-s3); set wide; time = 1 ; score = s1; if not missing(score) then output ; time = 2 ; score = s2; if not missing(score) then output ; time = 3 ; score = s3; if not missing(score) then output ; run ; 2 nd iteration: <ul><li>ID, TIME, and SCORE  Long </li></ul>Long: 4 3 S1 . 4 S2 2 A02 2 5 A01 1 S3 ID A01 K ID 4 D S1 . D S2 2 D S3 1 K TIME 4 K SCORE 2 K _N_ 1 3 2 1 TIME 4 A02 4 5 A01 3 4 A01 2 3 A01 1 SCORE ID
  126. 126. FROM WIDE FORMAT TO LONG FORMAT Wide: data long (drop=s1-s3); set wide; time = 1 ; score = s1; if not missing(score) then output ; time = 2 ; score = s2; if not missing(score) then output ; time = 3 ; score = s3; if not missing(score) then output ; run ; 2 nd iteration: <ul><li>TIME  2 </li></ul>Long: 4 3 S1 . 4 S2 2 A02 2 5 A01 1 S3 ID A01 K ID 4 D S1 . D S2 2 D S3 2 K TIME 4 K SCORE 2 K _N_ 1 3 2 1 TIME 4 A02 4 5 A01 3 4 A01 2 3 A01 1 SCORE ID
  127. 127. FROM WIDE FORMAT TO LONG FORMAT Wide: data long (drop=s1-s3); set wide; time = 1 ; score = s1; if not missing(score) then output ; time = 2 ; score = s2; if not missing(score) then output ; time = 3 ; score = s3; if not missing(score) then output ; run ; 2 nd iteration: <ul><li>SCORE  the value from S2 ( missing ) </li></ul>Long: 4 3 S1 . 4 S2 2 A02 2 5 A01 1 S3 ID A01 K ID 4 D S1 . D S2 2 D S3 2 K TIME . K SCORE 2 K _N_ 1 3 2 1 TIME 4 A02 4 5 A01 3 4 A01 2 3 A01 1 SCORE ID
  128. 128. FROM WIDE FORMAT TO LONG FORMAT Wide: data long (drop=s1-s3); set wide; time = 1 ; score = s1; if not missing(score) then output ; time = 2 ; score = s2; if not missing(score) then output ; time = 3 ; score = s3; if not missing(score) then output ; run ; 2 nd iteration: <ul><li>SCORE = missing : no output is generated </li></ul>Long: 4 3 S1 . 4 S2 2 A02 2 5 A01 1 S3 ID A01 K ID 4 D S1 . D S2 2 D S3 2 K TIME . K SCORE 2 K _N_ 1 3 2 1 TIME 4 A02 4 5 A01 3 4 A01 2 3 A01 1 SCORE ID
  129. 129. FROM WIDE FORMAT TO LONG FORMAT Wide: data long (drop=s1-s3); set wide; time = 1 ; score = s1; if not missing(score) then output ; time = 2 ; score = s2; if not missing(score) then output ; time = 3 ; score = s3; if not missing(score) then output ; run ; 2 nd iteration: <ul><li>TIME  3 </li></ul>Long: 4 3 S1 . 4 S2 2 A02 2 5 A01 1 S3 ID A01 K ID 4 D S1 . D S2 2 D S3 3 K TIME . K SCORE 2 K _N_ 1 3 2 1 TIME 4 A02 4 5 A01 3 4 A01 2 3 A01 1 SCORE ID
  130. 130. FROM WIDE FORMAT TO LONG FORMAT Wide: data long (drop=s1-s3); set wide; time = 1 ; score = s1; if not missing(score) then output ; time = 2 ; score = s2; if not missing(score) then output ; time = 3 ; score = s3; if not missing(score) then output ; run ; 2 nd iteration: <ul><li>SCORE  the value from S3 (2) </li></ul>Long: 4 3 S1 . 4 S2 2 A02 2 5 A01 1 S3 ID A01 K ID 4 D S1 . D S2 2 D S3 3 K TIME 2 K SCORE 2 K _N_ 1 3 2 1 TIME 4 A02 4 5 A01 3 4 A01 2 3 A01 1 SCORE ID
  131. 131. FROM WIDE FORMAT TO LONG FORMAT Wide: data long (drop=s1-s3); set wide; time = 1 ; score = s1; if not missing(score) then output ; time = 2 ; score = s2; if not missing(score) then output ; time = 3 ; score = s3; if not missing(score) then output ; run ; 2 nd iteration: <ul><li>ID, TIME, and SCORE  Long </li></ul>Long: 4 3 S1 . 4 S2 2 A02 2 5 A01 1 S3 ID A01 K ID 4 D S1 . D S2 2 D S3 3 K TIME 2 K SCORE 2 K _N_ 3 1 3 2 1 TIME 2 A02 5 4 A02 4 5 A01 3 4 A01 2 3 A01 1 SCORE ID
  132. 132. FROM WIDE FORMAT TO LONG FORMAT Wide: data long (drop=s1-s3); set wide; time = 1 ; score = s1; if not missing(score) then output ; time = 2 ; score = s2; if not missing(score) then output ; time = 3 ; score = s3; if not missing(score) then output ; run ; 2 nd iteration: <ul><li>SAS returns to the beginning of the DATA step to begin the 3rd iteration </li></ul><ul><li>With no more observations to read in the 3rd iteration, SAS goes to the next DATA or PROC step </li></ul>Long: 4 3 S1 . 4 S2 2 A02 2 5 A01 1 S3 ID A01 K ID 4 D S1 . D S2 2 D S3 3 K TIME 2 K SCORE 2 K _N_ 3 1 3 2 1 TIME 2 A02 5 4 A02 4 5 A01 3 4 A01 2 3 A01 1 SCORE ID
  133. 133. FROM LONG FORMAT TO WIDE FORMAT 4 3 S1 . 4 S2 2 A02 2 5 A01 1 S3 ID 3 1 3 2 1 TIME 2 A02 5 4 A02 4 5 A01 3 4 A01 2 3 A01 1 SCORE ID
  134. 134. FROM LONG FORMAT TO WIDE FORMAT <ul><li>Reading 5 observations but only creating 2 observations </li></ul><ul><ul><li>You are not copying data from the PDV to the final dataset at each iteration </li></ul></ul><ul><ul><li>You only need to generate one observation once all the observations for each subject have been processed </li></ul></ul>4 3 S1 . 4 S2 2 A02 2 5 A01 1 S3 ID 3 1 3 2 1 TIME 2 A02 5 4 A02 4 5 A01 3 4 A01 2 3 A01 1 SCORE ID
  135. 135. FROM LONG FORMAT TO WIDE FORMAT if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; <ul><li>Use BY-group processing: BY ID </li></ul><ul><li>Output to the final data when LAST.ID = 1 </li></ul><ul><li>SCORE  S1, S2 S3 </li></ul>RETAIN 4 3 S1 . 4 S2 2 A02 2 5 A01 1 S3 ID 3 1 3 2 1 TIME 2 A02 5 4 A02 4 5 A01 3 4 A01 2 3 A01 1 SCORE ID S3 S1 S3 S2 S1
  136. 136. FROM LONG FORMAT TO WIDE FORMAT proc sort data =long; by id; run ; data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id; run ; 4 3 S1 . 4 S2 2 A02 2 5 A01 1 S3 ID 3 1 3 2 1 TIME 2 A02 5 4 A02 4 5 A01 3 4 A01 2 3 A01 1 SCORE ID
  137. 137. FROM LONG FORMAT TO WIDE FORMAT data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id; run ; <ul><li>1 ST iteration: </li></ul><ul><li>_N_  1 </li></ul><ul><li>FIRST.ID  1, LAST.ID  1 </li></ul><ul><li>Other variables  missing </li></ul>3 1 3 2 1 TIME 2 A02 5 4 A02 4 5 A01 3 4 A01 2 3 A01 1 SCORE ID . . . . . 1 1 1 K S3 K S2 K S1 D SCORE D TIME K ID D LAST.ID D FIRST.ID D _N_
  138. 138. FROM LONG FORMAT TO WIDE FORMAT data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id; run ; <ul><li>1 ST iteration: </li></ul><ul><li>The SET statement copies the 1 st observation  PDV </li></ul>3 1 3 2 1 TIME 2 A02 5 4 A02 4 5 A01 3 4 A01 2 3 A01 1 SCORE ID . . . 3 1 A01 1 1 1 K S3 K S2 K S1 D SCORE D TIME K ID D LAST.ID D FIRST.ID D _N_
  139. 139. FROM LONG FORMAT TO WIDE FORMAT data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id; run ; <ul><li>1 ST iteration: </li></ul><ul><li>The SET statement copies the 1 st observation  PDV </li></ul><ul><li>FIRST.ID  1 since this is the 1 st observation for A01 </li></ul><ul><li>LAST.ID  0 since this is not the last observation for A01 </li></ul>3 1 3 2 1 TIME 2 A02 5 4 A02 4 5 A01 3 4 A01 2 3 A01 1 SCORE ID . . . 3 1 A01 0 1 1 K S3 K S2 K S1 D SCORE D TIME K ID D LAST.ID D FIRST.ID D _N_
  140. 140. FROM LONG FORMAT TO WIDE FORMAT data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id; run ; <ul><li>1 ST iteration: </li></ul><ul><li>Since TIME = 1, S1  SCORE (3) </li></ul>3 1 3 2 1 TIME 2 A02 5 4 A02 4 5 A01 3 4 A01 2 3 A01 1 SCORE ID . . 3 3 1 A01 0 1 1 K S3 K S2 K S1 D SCORE D TIME K ID D LAST.ID D FIRST.ID D _N_
  141. 141. FROM LONG FORMAT TO WIDE FORMAT data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id; run ; <ul><li>1 ST iteration: </li></ul><ul><li>The subsetting IF statement is evaluated to be FALSE </li></ul><ul><li>SAS returns to the beginning of the DATA step to begin the 2 nd iteration </li></ul>3 1 3 2 1 TIME 2 A02 5 4 A02 4 5 A01 3 4 A01 2 3 A01 1 SCORE ID . . 3 3 1 A01 0 1 1 K S3 K S2 K S1 D SCORE D TIME K ID D LAST.ID D FIRST.ID D _N_
  142. 142. FROM LONG FORMAT TO WIDE FORMAT data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id; run ; <ul><li>2 nd iteration: </li></ul><ul><li>_N_ ↑ 2 </li></ul>3 1 3 2 1 TIME 2 A02 5 4 A02 4 5 A01 3 4 A01 2 3 A01 1 SCORE ID . . 3 3 1 A01 0 1 2 K S3 K S2 K S1 D SCORE D TIME K ID D LAST.ID D FIRST.ID D _N_
  143. 143. FROM LONG FORMAT TO WIDE FORMAT data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id; run ; <ul><li>2 nd iteration: </li></ul><ul><li>FIRST.ID and LAST.ID are retained; they are automatic variables </li></ul><ul><li>ID, TIME, SCORE are retained; they are from input dataset </li></ul><ul><li>S1, S2, and S3 are retained because of the RETAIN statement </li></ul>3 1 3 2 1 TIME 2 A02 5 4 A02 4 5 A01 3 4 A01 2 3 A01 1 SCORE ID . . 3 3 1 A01 0 1 2 K S3 K S2 K S1 D SCORE D TIME K ID D LAST.ID D FIRST.ID D _N_
  144. 144. FROM LONG FORMAT TO WIDE FORMAT data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id; run ; <ul><li>2 nd iteration: </li></ul><ul><li>The SET statement copies the 2 nd observation to the PDV </li></ul>3 1 3 2 1 TIME 2 A02 5 4 A02 4 5 A01 3 4 A01 2 3 A01 1 SCORE ID . . 3 4 2 A01 0 1 2 K S3 K S2 K S1 D SCORE D TIME K ID D LAST.ID D FIRST.ID D _N_
  145. 145. FROM LONG FORMAT TO WIDE FORMAT data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id; run ; <ul><li>2 nd iteration: </li></ul><ul><li>The SET statement copies the 2 nd observation to the PDV </li></ul><ul><li>FIRST.ID  0; this is not the first observation for A01 </li></ul><ul><li>LAST.ID  0; this is not the last observation for A01 either </li></ul>3 1 3 2 1 TIME 2 A02 5 4 A02 4 5 A01 3 4 A01 2 3 A01 1 SCORE ID . . 3 4 2 A01 0 0 2 K S3 K S2 K S1 D SCORE D TIME K ID D LAST.ID D FIRST.ID D _N_
  146. 146. FROM LONG FORMAT TO WIDE FORMAT data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id; run ; <ul><li>2 nd iteration: </li></ul><ul><li>Since TIME = 2, S2  SCORE (4) </li></ul>3 1 3 2 1 TIME 2 A02 5 4 A02 4 5 A01 3 4 A01 2 3 A01 1 SCORE ID . 4 3 4 2 A01 0 0 2 K S3 K S2 K S1 D SCORE D TIME K ID D LAST.ID D FIRST.ID D _N_
  147. 147. FROM LONG FORMAT TO WIDE FORMAT data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id; run ; <ul><li>2 nd iteration: </li></ul><ul><li>The subsetting IF statement is evaluated to be FALSE </li></ul><ul><li>SAS returns to the beginning of the DATA step to begin the 3 rd iteration </li></ul>3 1 3 2 1 TIME 2 A02 5 4 A02 4 5 A01 3 4 A01 2 3 A01 1 SCORE ID . 4 3 4 2 A01 0 0 2 K S3 K S2 K S1 D SCORE D TIME K ID D LAST.ID D FIRST.ID D _N_
  148. 148. FROM LONG FORMAT TO WIDE FORMAT data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id; run ; <ul><li>3 rd iteration: </li></ul><ul><li>_N_ ↑ 3 </li></ul><ul><li>The rest of the variables are retained </li></ul>3 1 3 2 1 TIME 2 A02 5 4 A02 4 5 A01 3 4 A01 2 3 A01 1 SCORE ID . 4 3 4 2 A01 0 0 3 K S3 K S2 K S1 D SCORE D TIME K ID D LAST.ID D FIRST.ID D _N_
  149. 149. FROM LONG FORMAT TO WIDE FORMAT data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id; run ; <ul><li>3 rd iteration: </li></ul><ul><li>The SET statement copies the 3 rd observation  PDV </li></ul>3 1 3 2 1 TIME 2 A02 5 4 A02 4 5 A01 3 4 A01 2 3 A01 1 SCORE ID . 4 3 5 3 A01 0 0 3 K S3 K S2 K S1 D SCORE D TIME K ID D LAST.ID D FIRST.ID D _N_
  150. 150. FROM LONG FORMAT TO WIDE FORMAT data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id; run ; <ul><li>3 rd iteration: </li></ul><ul><li>The SET statement copies the 3 rd observation  PDV </li></ul><ul><li>FIRST.ID  0; this is not the first observation for A01 </li></ul><ul><li>LAST.ID  1; this is the last observation for A01 </li></ul>3 1 3 2 1 TIME 2 A02 5 4 A02 4 5 A01 3 4 A01 2 3 A01 1 SCORE ID . 4 3 5 3 A01 1 0 3 K S3 K S2 K S1 D SCORE D TIME K ID D LAST.ID D FIRST.ID D _N_
  151. 151. FROM LONG FORMAT TO WIDE FORMAT data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id; run ; <ul><li>3 rd iteration: </li></ul><ul><li>Since TIME = 3, S3  SCORE (5) </li></ul>3 1 3 2 1 TIME 2 A02 5 4 A02 4 5 A01 3 4 A01 2 3 A01 1 SCORE ID 5 4 3 5 3 A01 1 0 3 K S3 K S2 K S1 D SCORE D TIME K ID D LAST.ID D FIRST.ID D _N_
  152. 152. FROM LONG FORMAT TO WIDE FORMAT data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id; run ; <ul><li>3 rd iteration: </li></ul><ul><li>The subsetting IF statement is evaluated to be true </li></ul>3 1 3 2 1 TIME 2 A02 5 4 A02 4 5 A01 3 4 A01 2 3 A01 1 SCORE ID 5 4 3 5 3 A01 1 0 3 K S3 K S2 K S1 D SCORE D TIME K ID D LAST.ID D FIRST.ID D _N_ 3 S1 4 S2 5 A01 1 S3 ID
  153. 153. FROM LONG FORMAT TO WIDE FORMAT data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id; run ; <ul><li>3 rd iteration: </li></ul><ul><li>The implicit OUTPUT executes - variables marked with (K) are copied to the dataset wide </li></ul><ul><li>SAS returns to the beginning of the DATA step to begin the 4 th iteration </li></ul>3 1 3 2 1 TIME 2 A02 5 4 A02 4 5 A01 3 4 A01 2 3 A01 1 SCORE ID 5 4 3 5 3 A01 1 0 3 K S3 K S2 K S1 D SCORE D TIME K ID D LAST.ID D FIRST.ID D _N_ 3 S1 4 S2 5 A01 1 S3 ID
  154. 154. FROM LONG FORMAT TO WIDE FORMAT data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id; run ; <ul><li>4 th iteration: </li></ul><ul><li>_N_ ↑ 4 </li></ul><ul><li>The rest of the variables are retained </li></ul>3 1 3 2 1 TIME 2 A02 5 4 A02 4 5 A01 3 4 A01 2 3 A01 1 SCORE ID 5 4 3 5 3 A01 1 0 4 K S3 K S2 K S1 D SCORE D TIME K ID D LAST.ID D FIRST.ID D _N_ 3 S1 4 S2 5 A01 1 S3 ID
  155. 155. FROM LONG FORMAT TO WIDE FORMAT data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id; run ; <ul><li>4 th iteration: </li></ul><ul><li>The SET statement copies the 4 th observation  PDV </li></ul>3 1 3 2 1 TIME 2 A02 5 4 A02 4 5 A01 3 4 A01 2 3 A01 1 SCORE ID 5 4 3 4 1 A02 1 0 4 K S3 K S2 K S1 D SCORE D TIME K ID D LAST.ID D FIRST.ID D _N_ 3 S1 4 S2 5 A01 1 S3 ID
  156. 156. FROM LONG FORMAT TO WIDE FORMAT data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id; run ; <ul><li>4 th iteration: </li></ul><ul><li>The SET statement copies the 4 th observation  PDV </li></ul><ul><li>FIRST.ID  1; this is the first observation for A02 </li></ul><ul><li>LAST.ID  0; this is not the last observation for A02 </li></ul>3 1 3 2 1 TIME 2 A02 5 4 A02 4 5 A01 3 4 A01 2 3 A01 1 SCORE ID 5 4 3 4 1 A02 0 1 4 K S3 K S2 K S1 D SCORE D TIME K ID D LAST.ID D FIRST.ID D _N_ 3 S1 4 S2 5 A01 1 S3 ID
  157. 157. FROM LONG FORMAT TO WIDE FORMAT data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id; run ; <ul><li>4 th iteration: </li></ul><ul><li>Since TIME = 1, S1  SCORE (4) </li></ul>3 1 3 2 1 TIME 2 A02 5 4 A02 4 5 A01 3 4 A01 2 3 A01 1 SCORE ID 5 4 4 4 1 A02 0 1 4 K S3 K S2 K S1 D SCORE D TIME K ID D LAST.ID D FIRST.ID D _N_ 3 S1 4 S2 5 A01 1 S3 ID
  158. 158. FROM LONG FORMAT TO WIDE FORMAT data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id; run ; <ul><li>4 th iteration: </li></ul><ul><li>The subsetting IF statement is evaluated to be FALSE </li></ul><ul><li>SAS returns to the beginning of the DATA step to begin the 5 th iteration </li></ul>3 1 3 2 1 TIME 2 A02 5 4 A02 4 5 A01 3 4 A01 2 3 A01 1 SCORE ID 5 4 4 4 1 A02 0 1 4 K S3 K S2 K S1 D SCORE D TIME K ID D LAST.ID D FIRST.ID D _N_ 3 S1 4 S2 5 A01 1 S3 ID
  159. 159. FROM LONG FORMAT TO WIDE FORMAT data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id; run ; <ul><li>5 th iteration: </li></ul><ul><li>_N_ ↑ 5 </li></ul><ul><li>The rest of the variables are retained </li></ul>3 1 3 2 1 TIME 2 A02 5 4 A02 4 5 A01 3 4 A01 2 3 A01 1 SCORE ID 5 4 4 4 1 A02 0 1 5 K S3 K S2 K S1 D SCORE D TIME K ID D LAST.ID D FIRST.ID D _N_ 3 S1 4 S2 5 A01 1 S3 ID
  160. 160. FROM LONG FORMAT TO WIDE FORMAT data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id; run ; <ul><li>5 th iteration: </li></ul><ul><li>The SET statement copies the 5 th observation  PDV </li></ul>3 1 3 2 1 TIME 2 A02 5 4 A02 4 5 A01 3 4 A01 2 3 A01 1 SCORE ID 5 4 4 2 3 A02 0 1 5 K S3 K S2 K S1 D SCORE D TIME K ID D LAST.ID D FIRST.ID D _N_ 3 S1 4 S2 5 A01 1 S3 ID
  161. 161. FROM LONG FORMAT TO WIDE FORMAT data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id; run ; <ul><li>5 th iteration: </li></ul><ul><li>The SET statement copies the 5 th observation  PDV </li></ul><ul><li>FIRST.ID  0; this is not the first observation for A02 </li></ul><ul><li>LAST.ID  1; this is the last observation for A02 </li></ul>3 1 3 2 1 TIME 2 A02 5 4 A02 4 5 A01 3 4 A01 2 3 A01 1 SCORE ID 5 4 4 2 3 A02 1 0 5 K S3 K S2 K S1 D SCORE D TIME K ID D LAST.ID D FIRST.ID D _N_ 3 S1 4 S2 5 A01 1 S3 ID
  162. 162. FROM LONG FORMAT TO WIDE FORMAT data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id; run ; <ul><li>5 th iteration: </li></ul><ul><li>Since TIME = 3, S3  SCORE (2) </li></ul>3 1 3 2 1 TIME 2 A02 5 4 A02 4 5 A01 3 4 A01 2 3 A01 1 SCORE ID 2 4 4 2 3 A02 1 0 5 K S3 K S2 K S1 D SCORE D TIME K ID D LAST.ID D FIRST.ID D _N_ 3 S1 4 S2 5 A01 1 S3 ID
  163. 163. FROM LONG FORMAT TO WIDE FORMAT data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id; run ; <ul><li>5 th iteration: </li></ul><ul><li>The subsetting IF statement is evaluated to be TRUE </li></ul>3 1 3 2 1 TIME 2 A02 5 4 A02 4 5 A01 3 4 A01 2 3 A01 1 SCORE ID 2 4 4 2 3 A02 1 0 5 K S3 K S2 K S1 D SCORE D TIME K ID D LAST.ID D FIRST.ID D _N_ 3 S1 4 S2 5 A01 1 S3 ID
  164. 164. FROM LONG FORMAT TO WIDE FORMAT data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id; run ; <ul><li>5 th iteration: </li></ul><ul><li>The implicit OUTPUT executes </li></ul>How to fix this? 3 1 3 2 1 TIME 2 A02 5 4 A02 4 5 A01 3 4 A01 2 3 A01 1 SCORE ID 2 4 4 2 3 A02 1 0 5 K S3 K S2 K S1 D SCORE D TIME K ID D LAST.ID D FIRST.ID D _N_ 2 4 4 A02 2 3 S1 4 S2 5 A01 1 S3 ID
  165. 165. FROM LONG FORMAT TO WIDE FORMAT data wide (drop=time score); set long; by id; retain s1 - s3; if first.id then do ; s1 = . ; s2 = . ; s3 = . ; end ; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id; run ;
  166. 166. FROM LONG FORMAT TO WIDE FORMAT data wide (drop=time score); set long; by id; retain s1 - s3; if first.id then do ; s1 = . ; s2 = . ; s3 = . ; end ; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id; run ; <ul><li>4 th iteration: </li></ul><ul><li>_N_ ↑ 4 </li></ul><ul><li>The rest of the variables are retained </li></ul>3 1 3 2 1 TIME 2 A02 5 4 A02 4 5 A01 3 4 A01 2 3 A01 1 SCORE ID 5 4 3 5 3 A01 1 0 4 K S3 K S2 K S1 D SCORE D TIME K ID D LAST.ID D FIRST.ID D _N_ 3 S1 4 S2 5 A01 1 S3 ID
  167. 167. FROM LONG FORMAT TO WIDE FORMAT data wide (drop=time score); set long; by id; retain s1 - s3; if first.id then do ; s1 = . ; s2 = . ; s3 = . ; end ; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id; run ; <ul><li>4 th iteration: </li></ul><ul><li>The SET statement copies the 4 th observation  PDV </li></ul>3 1 3 2 1 TIME 2 A02 5 4 A02 4 5 A01 3 4 A01 2 3 A01 1 SCORE ID 5 4 3 4 1 A02 1 0 4 K S3 K S2 K S1 D SCORE D TIME K ID D LAST.ID D FIRST.ID D _N_ 3 S1 4 S2 5 A01 1 S3 ID
  168. 168. FROM LONG FORMAT TO WIDE FORMAT data wide (drop=time score); set long; by id; retain s1 - s3; if first.id then do ; s1 = . ; s2 = . ; s3 = . ; end ; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id; run ; <ul><li>4 th iteration: </li></ul><ul><li>The SET statement copies the 4 th observation  PDV </li></ul><ul><li>FIRST.ID  1; this is the first observation for A02 </li></ul><ul><li>LAST.ID  0; this is not the last observation for A02 </li></ul>3 1 3 2 1 TIME 2 A02 5 4 A02 4 5 A01 3 4 A01 2 3 A01 1 SCORE ID 5 4 3 4 1 A02 0 1 4 K S3 K S2 K S1 D SCORE D TIME K ID D LAST.ID D FIRST.ID D _N_ 3 S1 4 S2 5 A01 1 S3 ID
  169. 169. FROM LONG FORMAT TO WIDE FORMAT data wide (drop=time score); set long; by id; retain s1 - s3; if first.id then do ; s1 = . ; s2 = . ; s3 = . ; end ; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id; run ; <ul><li>4 th iteration: </li></ul><ul><li>Since FIRST.ID = 1, S1 – S3  missing </li></ul>3 1 3 2 1 TIME 2 A02 5 4 A02 4 5 A01 3 4 A01 2 3 A01 1 SCORE ID . . . 4 1 A02 0 1 4 K S3 K S2 K S1 D SCORE D TIME K ID D LAST.ID D FIRST.ID D _N_ 3 S1 4 S2 5 A01 1 S3 ID
  170. 170. FROM LONG FORMAT TO WIDE FORMAT data wide (drop=time score); set long; by id; retain s1 - s3; if first.id then do ; s1 = . ; s2 = . ; s3 = . ; end ; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id; run ; <ul><li>4 th iteration: </li></ul><ul><li>Since TIME = 1, S1  SCORE (4) </li></ul>3 1 3 2 1 TIME 2 A02 5 4 A02 4 5 A01 3 4 A01 2 3 A01 1 SCORE ID . . 4 4 1 A02 0 1 4 K S3 K S2 K S1 D SCORE D TIME K ID D LAST.ID D FIRST.ID D _N_ 3 S1 4 S2 5 A01 1 S3 ID
  171. 171. FROM LONG FORMAT TO WIDE FORMAT data wide (drop=time score); set long; by id; retain s1 - s3; if first.id then do ; s1 = . ; s2 = . ; s3 = . ; end ; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id; run ; <ul><li>4 th iteration: </li></ul><ul><li>The subsetting IF statement is evaluated to be false </li></ul><ul><li>SAS returns to the beginning of the DATA step to begin the 5 th iteration </li></ul>3 1 3 2 1 TIME 2 A02 5 4 A02 4 5 A01 3 4 A01 2 3 A01 1 SCORE ID . . 4 4 1 A02 0 1 4 K S3 K S2 K S1 D SCORE D TIME K ID D LAST.ID D FIRST.ID D _N_ 3 S1 4 S2 5 A01 1 S3 ID
  172. 172. FROM LONG FORMAT TO WIDE FORMAT data wide (drop=time score); set long; by id; retain s1 - s3; if first.id then do ; s1 = . ; s2 = . ; s3 = . ; end ; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id; run ; <ul><li>5 th iteration: </li></ul><ul><li>_N_ ↑ 5 </li></ul><ul><li>The rest of the variables are retained </li></ul>3 1 3 2 1 TIME 2 A02 5 4 A02 4 5 A01 3 4 A01 2 3 A01 1 SCORE ID . . 4 4 1 A02 0 1 5 K S3 K S2 K S1 D SCORE D TIME K ID D LAST.ID D FIRST.ID D _N_ 3 S1 4 S2 5 A01 1 S3 ID
  173. 173. FROM LONG FORMAT TO WIDE FORMAT data wide (drop=time score); set long; by id; retain s1 - s3; if first.id then do ; s1 = . ; s2 = . ; s3 = . ; end ; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id; run ; <ul><li>5 th iteration: </li></ul><ul><li>The SET statement copies the 5 th observation  PDV </li></ul>3 1 3 2 1 TIME 2 A02 5 4 A02 4 5 A01 3 4 A01 2 3 A01 1 SCORE ID . . 4 2 3 A02 0 1 5 K S3 K S2 K S1 D SCORE D TIME K ID D LAST.ID D FIRST.ID D _N_ 3 S1 4 S2 5 A01 1 S3 ID
  174. 174. FROM LONG FORMAT TO WIDE FORMAT data wide (drop=time score); set long; by id; retain s1 - s3; if first.id then do ; s1 = . ; s2 = . ; s3 = . ; end ; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id; run ; <ul><li>5 th iteration: </li></ul><ul><li>The SET statement copies the 5 th observation  PDV </li></ul><ul><li>FIRST.ID  0; this is not the first observation for A02 </li></ul><ul><li>LAST.ID  1; this is the last observation for A02 </li></ul>3 1 3 2 1 TIME 2 A02 5 4 A02 4 5 A01 3 4 A01 2 3 A01 1 SCORE ID . . 4 2 3 A02 1 0 5 K S3 K S2 K S1 D SCORE D TIME K ID D LAST.ID D FIRST.ID D _N_ 3 S1 4 S2 5 A01 1 S3 ID
  175. 175. FROM LONG FORMAT TO WIDE FORMAT data wide (drop=time score); set long; by id; retain s1 - s3; if first.id then do ; s1 = . ; s2 = . ; s3 = . ; end ; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id; run ; <ul><li>5 th iteration: </li></ul><ul><li>Since FIRST.ID ≠1, no execution </li></ul>3 1 3 2 1 TIME 2 A02 5 4 A02 4 5 A01 3 4 A01 2 3 A01 1 SCORE ID . . 4 2 3 A02 1 0 5 K S3 K S2 K S1 D SCORE D TIME K ID D LAST.ID D FIRST.ID D _N_ 3 S1 4 S2 5 A01 1 S3 ID
  176. 176. FROM LONG FORMAT TO WIDE FORMAT data wide (drop=time score); set long; by id; retain s1 - s3; if first.id then do ; s1 = . ; s2 = . ; s3 = . ; end ; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id; run ; <ul><li>5 th iteration: </li></ul><ul><li>Since TIME = 3, S3  SCORE (2) </li></ul>3 1 3 2 1 TIME 2 A02 5 4 A02 4 5 A01 3 4 A01 2 3 A01 1 SCORE ID 2 . 4 2 3 A02 1 0 5 K S3 K S2 K S1 D SCORE D TIME K ID D LAST.ID D FIRST.ID D _N_ 3 S1 4 S2 5 A01 1 S3 ID
  177. 177. FROM LONG FORMAT TO WIDE FORMAT data wide (drop=time score); set long; by id; retain s1 - s3; if first.id then do ; s1 = . ; s2 = . ; s3 = . ; end ; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id; run ; <ul><li>5 th iteration: </li></ul><ul><li>The subsetting IF statement is evaluated to be true </li></ul>3 1 3 2 1 TIME 2 A02 5 4 A02 4 5 A01 3 4 A01 2 3 A01 1 SCORE ID 2 . 4 2 3 A02 1 0 5 K S3 K S2 K S1 D SCORE D TIME K ID D LAST.ID D FIRST.ID D _N_ 3 S1 4 S2 5 A01 1 S3 ID
  178. 178. FROM LONG FORMAT TO WIDE FORMAT data wide (drop=time score); set long; by id; retain s1 - s3; if first.id then do ; s1 = . ; s2 = . ; s3 = . ; end ; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id; run ; <ul><li>5 th iteration: </li></ul><ul><li>SAS reaches the end of the 5 th iteration </li></ul><ul><li>The implicit OUTPUT executes </li></ul>3 1 3 2 1 TIME 2 A02 5 4 A02 4 5 A01 3 4 A01 2 3 A01 1 SCORE ID 2 . 4 2 3 A02 1 0 5 K S3 K S2 K S1 D SCORE D TIME K ID D LAST.ID D FIRST.ID D _N_ 2 . 4 A02 2 3 S1 4 S2 5 A01 1 S3 ID
  179. 179. CONCLUSION <ul><li>The most important part of DATA step processing is to understand how data is transformed to the PDV and how data is copied from the PDV to a new dataset </li></ul><ul><li>To be a successful SAS programmer, we must be able to thoroughly comprehend how DATA steps are processed </li></ul>
  180. 180. REFERENCES <ul><li>Cody, Ron. 2001. Longitudinal Data and SAS® A Programmer’s Guide. Cary, NC: SAS Institute Inc. </li></ul>
  181. 181. ACKNOWLEDGEMENT <ul><li>I would like to thank MaryAnne DePesquo for inviting me to present at the SGF 2011 </li></ul>
  182. 182. CONTACT INFORMATION <ul><li>Arthur X. Li </li></ul><ul><li>City of Hope Comprehensive Cancer Center </li></ul><ul><li>Division of Information Science </li></ul><ul><li>1500 East Duarte Road </li></ul><ul><li>Duarte, CA 91010 - 3000 </li></ul><ul><li>Work Phone: (626) 256-4673 ext. 65121 </li></ul><ul><li>Fax: (626) 471-7106 </li></ul><ul><li>E-mail: xueli@coh.org </li></ul>

×