SlideShare a Scribd company logo
1 of 19
Download to read offline
Compressing Output Files
        in a SAS® job,
      on a UNIX platform
                         Leslie J. Somos
                   Great Works Informatics, LLC


1    © 2008 by Great Works Informatics, LLC
Original situation
                              UNIX                             MS Windows

                                                         FTP                email
             SAS                            file               file




        2       © 2008 by Great Works Informatics, LLC




Done for multiple clients.
No problem, until file sometimes is bigger for clients with more data.


As file gets bigger,
FTP from UNIX to MS Windows takes longer,
 and email administrators get unhappy.
Original situation
                              UNIX                                 MS Windows


             SAS                           big           FTP      big
                                           file                   file

                                                                  zip
                                                                                email
                                                               zipped file


        3       © 2008 by Great Works Informatics, LLC




We (manually) zip the file to not annoy the email administrators.


Everything is fine until we hit a wall,
with one particular request for a multi-year extract,
 and run out of UNIX space to create the file.


(Two million records, fixed length, record length 2000 bytes,
the resulting file would have been ~4GBytes.
 Would have been -- if there had been enough UNIX space.)


--


In each case, after the file is created,
we don’t do anything further with it on UNIX,
 so it would not be a problem if we could somehow get SAS to write out an already-
zipped file.
Goal
                            UNIX                                 MS Windows


            SAS




                                                       FTP                    email
                                zipped file                  zipped file


       4      © 2008 by Great Works Informatics, LLC




So this picture shows what we want to happen -- have SAS write an already-
compressed file, which takes up less disk space and also takes less time to FTP
from UNIX over to MS Windows.


(4GBytes produces a zipped file ~74Mbytes.
FTP from UNIX to MS Windows ~22 minutes.)
Items we will touch on
    • 'pipe' access method of FILENAME
      statement
    • 'nobs=' and 'point=' options of SET
      statement
    • compilation phase v. execution phase of
      DATA Step



5     © 2008 by Great Works Informatics, LLC
Coding Conventions
    • SAS keywords in lowercase
    • User-chosen words in uppercase
    • Spaces supplied even where not required
      by the syntax
    • Optional period after a macro variable
      always supplied



6     © 2008 by Great Works Informatics, LLC
Original code

            filename OUT
              'BIG.TXT'
                  lrecl=32760 ;
            ordinary file



            data _null_ ;
              set OURDATA ;
              file OUT ;
              put <field>…<field> ;
            run ;



        7        © 2008 by Great Works Informatics, LLC




Modify the "filename" statement to use an 'unnamed pipe' to write the data to the
"compress" program, which then writes a compressed version to disk.
The full-size file only ever exists on-the-fly, as it flows through the OUT fileref.
Only the compressed version of the file ever exists on disk, so peak disk space
usage is less.
Modified code

            filename OUT pipe
              'compress > NOTSOBIG.Z'
                  lrecl=32760 ;
            "unnamed pipe"
            Program compress reads its STDIN, receives data written to filename OUT;
               and writes to its STDOUT, which is redirected to NOTSOBIG.Z.
            data _null_ ;
               set OURDATA ;
               file OUT ;
               put <field>…<field> ;
            run ;



        8        © 2008 by Great Works Informatics, LLC




Modify the "filename" statement to use an 'unnamed pipe' to write the data to the
"compress" program, which then writes a compressed version to disk.
The full-size file only ever exists on-the-fly, as it flows through the OUT fileref.
Only the compressed version of the file ever exists on disk, so peak disk space
usage is less.
Original code :: Modified code
           filename OUT                                  filename OUT pipe
             'BIG.TXT'                                     'compress > NOTSOBIG.Z'
               lrecl=32760 ;                                 lrecl=32760 ;

           ordinary file                                 "unnamed pipe“

           data _null_ ;                                 data _null_ ;
             set OURDATA ;                                 set OURDATA ;
             file OUT ;                                    file OUT ;
             put <field>…<field> ;                         put <field>…<field> ;
           run ;                                         run ;

                                                         No changes to data step, only to filename.
                                                         When you uncompress, supply name BIG.TXT.
       0
       9        © 2008 by Great Works Informatics, LLC




A simple modification: No changes were necessary to the data step, only to the
filename statement.


Since the "compress" command only received a bytestream, it couldn't and didn't
record any file name within the compressed file.
When the file is subsequently uncompressed, the program will prompt for a file
name to be supplied.
Original code
             filename OUT
                'BIG.TXT'
                     lrecl=32760 ;




             data _null_ ;
               set ONEDATA ;
               file OUT ;
               put <field>…<field> ;
             run ;

             data _null_ ;
               set TWODATA ;
               file OUT mod ;
               put <field>…<field> ;
             run ;




        10       © 2008 by Great Works Informatics, LLC




Slightly more complex code, closer to the actual original code -- multiple data steps
write to the same output file.
(Original code had six data steps, two are sufficient to demonstrate the issues we
will discuss.)


Each data step after the first one appends to the output file, by using the "mod"
option of the file statement.
Original code :: Modified code
             filename OUT                                 filename OUT pipe
                'BIG.TXT'                                    'compress > NOTSOBIG.Z'
                     lrecl=32760 ;                                lrecl=32760 ;




             data _null_ ;                                data _null_ ;
               set ONEDATA ;                                set ONEDATA ;
               file OUT ;                                   file OUT ;
               put <field>…<field> ;                        put <field>…<field> ;
             run ;                                        run ;

             data _null_ ;                                data _null_ ;
               set TWODATA ;                                set TWODATA ;
               file OUT mod ;                               file OUT mod ;
               put <field>…<field> ;                        put <field>…<field> ;
             run ;                                        run ;

                                                          No changes to data steps – problem!
                                                          Program compress gets called a 2nd time,
                                                              and overwrites NOTSOBIG.Z.
        11       © 2008 by Great Works Informatics, LLC   Only TWODATA is in the resulting file.




If we follow the pattern we know, we modify just the "filename" statement.
Each data step starts the "compress" program anew, and overwrites the previous
content of the output file NOTSOBIG.Z.


The "mod" option of the "file" statement has no effect here, when it refers to an
unnamed pipe.
Original code :: Modified code
            filename OUT                                 filename OUT pipe
               'BIG.TXT'                                    'compress > NOTSOBIG.Z'
                    lrecl=32760 ;                                lrecl=32760 ;

                                                         data _null_ ;
                                                           file OUT ;

            data _null_ ;                                  set ONEDATA nobs=CONSTNOBS1 ;
              set ONEDATA ;                                do POINTVAR = 1 to CONSTNOBS1 ;
              file OUT ;                                     set ONEDATA point=POINTVAR ;
              put <field>…<field> ;                          put <field>…<field> ;
            run ;                                          end ;

            data _null_ ;                                  set TWODATA nobs=CONSTNOBS2 ;
              set TWODATA ;                                do POINTVAR = 1 to CONSTNOBS2 ;
              file OUT mod ;                                 set TWODATA point=POINTVAR ;
              put <field>…<field> ;                          put <field>…<field> ;
            run ;                                          end ;

                                                           stop ; /*<=REMEMBER!*/
                                                         run ;
                                                         Only one data step, one call to compress.
       12       © 2008 by Great Works Informatics, LLC   Problem if ONEDATA has zero observations.




Solution attempt: Combine the multiple data steps into one single data step, so
there is only one invocation of the "compress" program.
[When you take over from the normal SAS data cycle, you also have to include logic
to stop execution.]


Note -- the single variable "POINTVAR" is used in multiple loops with no problem.
The two different "nobs=" variables in the two separate "set" statements must be
different from each other, else one interferes with the other at compile time.


Works in most cases, but:
If any data set before the last has zero observations, execution stops when its "set"
statement is executed, and the output file is incomplete.
Original code :: Modified code
             filename OUT                                 filename OUT pipe
                'BIG.TXT'                                    'compress > NOTSOBIG.Z'
                     lrecl=32760 ;                                lrecl=32760 ;

                                                          data _null_ ;
                                                            file OUT ;
                                                            if 0 then
             data _null_ ;                                   set ONEDATA nobs=CONSTNOBS1 ;
               set ONEDATA ;                                do POINTVAR = 1 to CONSTNOBS1 ;
               file OUT ;                                     set ONEDATA point=POINTVAR ;
               put <field>…<field> ;                          put <field>…<field> ;
             run ;                                          end ;
                                                            if 0 then
             data _null_ ;                                   set TWODATA nobs=CONSTNOBS2 ;
               set TWODATA ;                                do POINTVAR = 1 to CONSTNOBS2 ;
               file OUT mod ;                                 set TWODATA point=POINTVAR ;
               put <field>…<field> ;                          put <field>…<field> ;
             run ;                                          end ;

                                                            stop ; /*<=REMEMBER!*/
                                                          run ;

        13       © 2008 by Great Works Informatics, LLC   Works fine, looks a little cluttered.




The "set" statements where the values of CONSTNOBS1 and CONSTNOBS2 are
set have their effect at compile time, not at execution time.
So, they don't ever have to be executed, they simply have to be present at compile
time.
They could be anywhere within the data step, they simply have to be present at
compile time.


As an alternative to prefixing each "set ... nobs=" statement with "if 0" or "if 1 = 2"
etc.,
we could move them to after the "stop" statement.
Original code :: Modified code
             filename OUT                                 filename OUT pipe
                'BIG.TXT'                                    'compress > NOTSOBIG.Z'
                     lrecl=32760 ;                                lrecl=32760 ;

                                                          data _null_ ;
                                                            file OUT ;

             data _null_ ;                                  do POINTVAR = 1 to CONSTNOBS1 ;
               set ONEDATA ;                                  set ONEDATA point=POINTVAR
               file OUT ;                                                  nobs=CONSTNOBS1 ;
               put <field>…<field> ;                          put <field>…<field> ;
             run ;                                          end ;

             data _null_ ;                                  do POINTVAR = 1 to CONSTNOBS2 ;
               set TWODATA ;                                  set TWODATA point=POINTVAR
               file OUT mod ;                                              nobs=CONSTNOBS2 ;
               put <field>…<field> ;                          put <field>…<field> ;
             run ;                                          end ;

                                                            stop ; /*<=REMEMBER!*/
                                                          run ;

        14       © 2008 by Great Works Informatics, LLC




The compile-time "nobs=" and the execution-time "point=" options of the "set"
statement can both be present on a single "set" statement.


And, if any data set before the last has zero observations,
 its "set" statement is not actually executed at run time because its "nobs=" variable
is set to zero at compile time
 and that "do" loop reduces to "do POINTVAR = 1 to 0 ;"
 and execution does not enter that "do" loop.
DATA Step -- compilation phase v.
                execution phase
     http://v8doc.sas.com
     SAS OnlineDoc
     . Base SAS Software
     . . SAS Language Reference: Concepts
     . . . DATA Step Concepts
     . . . . DATA Step Processing
     . . . . . Overview of DATA Step Processing



     Flow of Action

     When you submit a DATA step for execution, it is first compiled and then
       executed.



     .
15        © 2008 by Great Works Informatics, LLC
SET
     http://v8doc.sas.com
     SAS OnlineDoc
     . Base SAS Software
     . . SAS Language Reference: Dictionary
     . . . Dictionary of Language Elements
     . . . . Statements
     . . . . . SET
     NOBS=variable
     At compilation time, SAS reads the descriptor portion of each data set and
         assigns the value of the NOBS= variable automatically. Thus, you can refer
         to the NOBS= variable before the SET statement.
     POINT=variable
     POINT= causes the SET statement to use random (direct) access to read a
         SAS data set.
     CAUTION:
         Continuous loops can occur when you use the POINT= option.
         When you use the POINT= option, you must include a STOP statement to stop
           DATA step processing, programming logic that checks for an invalid value of the
           POINT= variable, or both.
16        © 2008 by Great Works Informatics, LLC
Reading from and Writing to UNIX
            Commands (PIPE)
     http://v8doc.sas.com
     SAS OnlineDoc
     . Base SAS Software
     . . Host Specific Information
     . . . UNIX Environments (Companion)
     . . . . Running the SAS System Under UNIX
     . . . . . Using External Files and Devices
     . . . . . . Reading from and Writing to UNIX Commands (PIPE)

     FILENAME fileref PIPE 'UNIX-command' <options>;
     Under UNIX, you can use the FILENAME statement to assign filerefs not only
        to external files and I/O devices, but also to a pipe. Pipes enable your SAS
        application to receive input from any UNIX command that writes to standard
        output and to route output to any UNIX command that reads from standard
        input.
     .



17        © 2008 by Great Works Informatics, LLC
Using Unnamed Pipes
                                      (MS Windows)
            http://v8doc.sas.com
            SAS OnlineDoc
            . Base SAS Software
            . . Host Specific Information
            . . . Microsoft Windows Environment (Companion)
            . . . . Using SAS with Other Windows Applications
            . . . . . Using Unnamed and Named Pipes
            . . . . . . Using Unnamed Pipes

            FILENAME fileref PIPE 'program-name' option-list
                                                          NOTE: The infile DIR is:
      Example:                                                  Unnamed Pipe Access Device,
       filename DIR pipe 'dir /?' ;                             PROCESS=dir /?,RECFM=V,LRECL=256
       data _null_ ;                                      Displays a list of files and subdirectories in
                                                          a directory.
        infile DIR ;                                      ...
        input ;                                           NOTE: 38 records were read from the infile DIR.
         .                                                      The minimum record length was 0.
        put _infile_ ;                                          The maximum record length was 75.
       run ;                                              NOTE: DATA statement used (Total process time):
                                                                real time           0.09 seconds
                                                                cpu time            0.03 seconds
       18        © 2008 by Great Works Informatics, LLC




Unfortunately, I couldn't find any compression program under MS Windows which
would read a file to be compressed from its standard input.
Questions?

     SAS and all other SAS Institute Inc. product or service names are
     registered trademarks or trademarks of SAS Institute Inc. in the USA
     and other countries. ® indicates USA registration.

     Other brand and product names are registered trademarks or
     trademarks of their respective companies.

19       © 2008 by Great Works Informatics, LLC

More Related Content

What's hot

Ahmad-debian
Ahmad-debianAhmad-debian
Ahmad-debiansyaif-sae
 
Build Dynamic DNS server from scratch in C (Part1)
Build Dynamic DNS server from scratch in C (Part1)Build Dynamic DNS server from scratch in C (Part1)
Build Dynamic DNS server from scratch in C (Part1)Yen-Kuan Wu
 
HADOOP 실제 구성 사례, Multi-Node 구성
HADOOP 실제 구성 사례, Multi-Node 구성HADOOP 실제 구성 사례, Multi-Node 구성
HADOOP 실제 구성 사례, Multi-Node 구성Young Pyo
 
DNS Server Configuration
DNS Server ConfigurationDNS Server Configuration
DNS Server Configurationchacheng oo
 
DNS - Domain Name System
DNS - Domain Name SystemDNS - Domain Name System
DNS - Domain Name SystemPeter R. Egli
 
The DNSSEC KSK of the root rolls
The DNSSEC KSK of the root rollsThe DNSSEC KSK of the root rolls
The DNSSEC KSK of the root rollsMen and Mice
 
Linux or unix interview questions
Linux or unix interview questionsLinux or unix interview questions
Linux or unix interview questionsTeja Bheemanapally
 
Basic IT 2 (General IT Knowledge-2)
Basic IT 2 (General IT Knowledge-2)Basic IT 2 (General IT Knowledge-2)
Basic IT 2 (General IT Knowledge-2)kholis_mjd
 
DNSSEC Tutorial; USENIX LISA 2013
DNSSEC Tutorial; USENIX LISA 2013DNSSEC Tutorial; USENIX LISA 2013
DNSSEC Tutorial; USENIX LISA 2013Shumon Huque
 
gcis-zenworks7.2
gcis-zenworks7.2gcis-zenworks7.2
gcis-zenworks7.2KARLY21
 
Domain Name System (DNS) - Domain Registration and Website Hosting Basics
Domain Name System (DNS) - Domain Registration and Website Hosting BasicsDomain Name System (DNS) - Domain Registration and Website Hosting Basics
Domain Name System (DNS) - Domain Registration and Website Hosting BasicsAsif Shahzad
 
Namespaces for Local Networks
Namespaces for Local NetworksNamespaces for Local Networks
Namespaces for Local NetworksMen and Mice
 
Part 2 - Local Name Resolution in Windows Networks
Part 2 - Local Name Resolution in Windows NetworksPart 2 - Local Name Resolution in Windows Networks
Part 2 - Local Name Resolution in Windows NetworksMen and Mice
 
What is a domain name system(dns)?
What is a domain name system(dns)?What is a domain name system(dns)?
What is a domain name system(dns)?Abhishek Mitra
 

What's hot (20)

Whatistnsnames
WhatistnsnamesWhatistnsnames
Whatistnsnames
 
Ahmad-debian
Ahmad-debianAhmad-debian
Ahmad-debian
 
Dns
DnsDns
Dns
 
Dns
DnsDns
Dns
 
Build Dynamic DNS server from scratch in C (Part1)
Build Dynamic DNS server from scratch in C (Part1)Build Dynamic DNS server from scratch in C (Part1)
Build Dynamic DNS server from scratch in C (Part1)
 
HADOOP 실제 구성 사례, Multi-Node 구성
HADOOP 실제 구성 사례, Multi-Node 구성HADOOP 실제 구성 사례, Multi-Node 구성
HADOOP 실제 구성 사례, Multi-Node 구성
 
DNS Server Configuration
DNS Server ConfigurationDNS Server Configuration
DNS Server Configuration
 
DNS - Domain Name System
DNS - Domain Name SystemDNS - Domain Name System
DNS - Domain Name System
 
The DNSSEC KSK of the root rolls
The DNSSEC KSK of the root rollsThe DNSSEC KSK of the root rolls
The DNSSEC KSK of the root rolls
 
The History of DNS
The History of DNSThe History of DNS
The History of DNS
 
Linux or unix interview questions
Linux or unix interview questionsLinux or unix interview questions
Linux or unix interview questions
 
Basic IT 2 (General IT Knowledge-2)
Basic IT 2 (General IT Knowledge-2)Basic IT 2 (General IT Knowledge-2)
Basic IT 2 (General IT Knowledge-2)
 
DNSSEC Tutorial; USENIX LISA 2013
DNSSEC Tutorial; USENIX LISA 2013DNSSEC Tutorial; USENIX LISA 2013
DNSSEC Tutorial; USENIX LISA 2013
 
gcis-zenworks7.2
gcis-zenworks7.2gcis-zenworks7.2
gcis-zenworks7.2
 
Domain Name System (DNS) - Domain Registration and Website Hosting Basics
Domain Name System (DNS) - Domain Registration and Website Hosting BasicsDomain Name System (DNS) - Domain Registration and Website Hosting Basics
Domain Name System (DNS) - Domain Registration and Website Hosting Basics
 
Namespaces for Local Networks
Namespaces for Local NetworksNamespaces for Local Networks
Namespaces for Local Networks
 
Part 2 - Local Name Resolution in Windows Networks
Part 2 - Local Name Resolution in Windows NetworksPart 2 - Local Name Resolution in Windows Networks
Part 2 - Local Name Resolution in Windows Networks
 
What is a domain name system(dns)?
What is a domain name system(dns)?What is a domain name system(dns)?
What is a domain name system(dns)?
 
Ccd
CcdCcd
Ccd
 
Linux test paper2
Linux test paper2Linux test paper2
Linux test paper2
 

Similar to Compressing Output Files in a SAS® job, on a UNIX platform

The age of rename() is over
The age of rename() is overThe age of rename() is over
The age of rename() is overSteve Loughran
 
Jordan Hubbard Talk @ LISA
Jordan Hubbard Talk @ LISAJordan Hubbard Talk @ LISA
Jordan Hubbard Talk @ LISAguest4c923d
 
Getting started with SIP Express Media Server SIP app server and SBC - workshop
Getting started with SIP Express Media Server SIP app server and SBC - workshopGetting started with SIP Express Media Server SIP app server and SBC - workshop
Getting started with SIP Express Media Server SIP app server and SBC - workshopstefansayer
 
Linux Survival Kit for Proof of Concept & Proof of Technology
Linux Survival Kit for Proof of Concept & Proof of TechnologyLinux Survival Kit for Proof of Concept & Proof of Technology
Linux Survival Kit for Proof of Concept & Proof of TechnologyNugroho Gito
 
What does Rename Do: (detailed version)
What does Rename Do: (detailed version)What does Rename Do: (detailed version)
What does Rename Do: (detailed version)Steve Loughran
 
Recipe 14 of Data Warehouse and Business Intelligence - Build a Staging Area ...
Recipe 14 of Data Warehouse and Business Intelligence - Build a Staging Area ...Recipe 14 of Data Warehouse and Business Intelligence - Build a Staging Area ...
Recipe 14 of Data Warehouse and Business Intelligence - Build a Staging Area ...Massimo Cenci
 
Compact, Compress, De-Duplicate (DAOS)
Compact, Compress, De-Duplicate (DAOS)Compact, Compress, De-Duplicate (DAOS)
Compact, Compress, De-Duplicate (DAOS)Ulrich Krause
 
Improving Hadoop Resiliency and Operational Efficiency with EMC Isilon
Improving Hadoop Resiliency and Operational Efficiency with EMC IsilonImproving Hadoop Resiliency and Operational Efficiency with EMC Isilon
Improving Hadoop Resiliency and Operational Efficiency with EMC IsilonDataWorks Summit/Hadoop Summit
 
What does rename() do?
What does rename() do?What does rename() do?
What does rename() do?Steve Loughran
 
pointer, structure ,union and intro to file handling
 pointer, structure ,union and intro to file handling pointer, structure ,union and intro to file handling
pointer, structure ,union and intro to file handlingRai University
 
Digital Forensics
Digital ForensicsDigital Forensics
Digital ForensicsOldsun
 
Dynamic Hadoop Clusters
Dynamic Hadoop ClustersDynamic Hadoop Clusters
Dynamic Hadoop ClustersSteve Loughran
 
Pf cs102 programming-8 [file handling] (1)
Pf cs102 programming-8 [file handling] (1)Pf cs102 programming-8 [file handling] (1)
Pf cs102 programming-8 [file handling] (1)Abdullah khawar
 
Big data hadooop analytic and data warehouse comparison guide
Big data hadooop analytic and data warehouse comparison guideBig data hadooop analytic and data warehouse comparison guide
Big data hadooop analytic and data warehouse comparison guideDanairat Thanabodithammachari
 
Using an FTP client - Client server computing
Using an FTP client -  Client server computingUsing an FTP client -  Client server computing
Using an FTP client - Client server computinglordmwesh
 

Similar to Compressing Output Files in a SAS® job, on a UNIX platform (20)

File in cpp 2016
File in cpp 2016 File in cpp 2016
File in cpp 2016
 
The age of rename() is over
The age of rename() is overThe age of rename() is over
The age of rename() is over
 
168054408 cc1
168054408 cc1168054408 cc1
168054408 cc1
 
Jordan Hubbard Talk @ LISA
Jordan Hubbard Talk @ LISAJordan Hubbard Talk @ LISA
Jordan Hubbard Talk @ LISA
 
Getting started with SIP Express Media Server SIP app server and SBC - workshop
Getting started with SIP Express Media Server SIP app server and SBC - workshopGetting started with SIP Express Media Server SIP app server and SBC - workshop
Getting started with SIP Express Media Server SIP app server and SBC - workshop
 
Linux
LinuxLinux
Linux
 
Linux Survival Kit for Proof of Concept & Proof of Technology
Linux Survival Kit for Proof of Concept & Proof of TechnologyLinux Survival Kit for Proof of Concept & Proof of Technology
Linux Survival Kit for Proof of Concept & Proof of Technology
 
What does Rename Do: (detailed version)
What does Rename Do: (detailed version)What does Rename Do: (detailed version)
What does Rename Do: (detailed version)
 
Recipe 14 of Data Warehouse and Business Intelligence - Build a Staging Area ...
Recipe 14 of Data Warehouse and Business Intelligence - Build a Staging Area ...Recipe 14 of Data Warehouse and Business Intelligence - Build a Staging Area ...
Recipe 14 of Data Warehouse and Business Intelligence - Build a Staging Area ...
 
Compact, Compress, De-Duplicate (DAOS)
Compact, Compress, De-Duplicate (DAOS)Compact, Compress, De-Duplicate (DAOS)
Compact, Compress, De-Duplicate (DAOS)
 
Improving Hadoop Resiliency and Operational Efficiency with EMC Isilon
Improving Hadoop Resiliency and Operational Efficiency with EMC IsilonImproving Hadoop Resiliency and Operational Efficiency with EMC Isilon
Improving Hadoop Resiliency and Operational Efficiency with EMC Isilon
 
What does rename() do?
What does rename() do?What does rename() do?
What does rename() do?
 
Ffsj
FfsjFfsj
Ffsj
 
Gur1009
Gur1009Gur1009
Gur1009
 
pointer, structure ,union and intro to file handling
 pointer, structure ,union and intro to file handling pointer, structure ,union and intro to file handling
pointer, structure ,union and intro to file handling
 
Digital Forensics
Digital ForensicsDigital Forensics
Digital Forensics
 
Dynamic Hadoop Clusters
Dynamic Hadoop ClustersDynamic Hadoop Clusters
Dynamic Hadoop Clusters
 
Pf cs102 programming-8 [file handling] (1)
Pf cs102 programming-8 [file handling] (1)Pf cs102 programming-8 [file handling] (1)
Pf cs102 programming-8 [file handling] (1)
 
Big data hadooop analytic and data warehouse comparison guide
Big data hadooop analytic and data warehouse comparison guideBig data hadooop analytic and data warehouse comparison guide
Big data hadooop analytic and data warehouse comparison guide
 
Using an FTP client - Client server computing
Using an FTP client -  Client server computingUsing an FTP client -  Client server computing
Using an FTP client - Client server computing
 

Compressing Output Files in a SAS® job, on a UNIX platform

  • 1. Compressing Output Files in a SAS® job, on a UNIX platform Leslie J. Somos Great Works Informatics, LLC 1 © 2008 by Great Works Informatics, LLC
  • 2. Original situation UNIX MS Windows FTP email SAS file file 2 © 2008 by Great Works Informatics, LLC Done for multiple clients. No problem, until file sometimes is bigger for clients with more data. As file gets bigger, FTP from UNIX to MS Windows takes longer, and email administrators get unhappy.
  • 3. Original situation UNIX MS Windows SAS big FTP big file file zip email zipped file 3 © 2008 by Great Works Informatics, LLC We (manually) zip the file to not annoy the email administrators. Everything is fine until we hit a wall, with one particular request for a multi-year extract, and run out of UNIX space to create the file. (Two million records, fixed length, record length 2000 bytes, the resulting file would have been ~4GBytes. Would have been -- if there had been enough UNIX space.) -- In each case, after the file is created, we don’t do anything further with it on UNIX, so it would not be a problem if we could somehow get SAS to write out an already- zipped file.
  • 4. Goal UNIX MS Windows SAS FTP email zipped file zipped file 4 © 2008 by Great Works Informatics, LLC So this picture shows what we want to happen -- have SAS write an already- compressed file, which takes up less disk space and also takes less time to FTP from UNIX over to MS Windows. (4GBytes produces a zipped file ~74Mbytes. FTP from UNIX to MS Windows ~22 minutes.)
  • 5. Items we will touch on • 'pipe' access method of FILENAME statement • 'nobs=' and 'point=' options of SET statement • compilation phase v. execution phase of DATA Step 5 © 2008 by Great Works Informatics, LLC
  • 6. Coding Conventions • SAS keywords in lowercase • User-chosen words in uppercase • Spaces supplied even where not required by the syntax • Optional period after a macro variable always supplied 6 © 2008 by Great Works Informatics, LLC
  • 7. Original code filename OUT 'BIG.TXT' lrecl=32760 ; ordinary file data _null_ ; set OURDATA ; file OUT ; put <field>…<field> ; run ; 7 © 2008 by Great Works Informatics, LLC Modify the "filename" statement to use an 'unnamed pipe' to write the data to the "compress" program, which then writes a compressed version to disk. The full-size file only ever exists on-the-fly, as it flows through the OUT fileref. Only the compressed version of the file ever exists on disk, so peak disk space usage is less.
  • 8. Modified code filename OUT pipe 'compress > NOTSOBIG.Z' lrecl=32760 ; "unnamed pipe" Program compress reads its STDIN, receives data written to filename OUT; and writes to its STDOUT, which is redirected to NOTSOBIG.Z. data _null_ ; set OURDATA ; file OUT ; put <field>…<field> ; run ; 8 © 2008 by Great Works Informatics, LLC Modify the "filename" statement to use an 'unnamed pipe' to write the data to the "compress" program, which then writes a compressed version to disk. The full-size file only ever exists on-the-fly, as it flows through the OUT fileref. Only the compressed version of the file ever exists on disk, so peak disk space usage is less.
  • 9. Original code :: Modified code filename OUT filename OUT pipe 'BIG.TXT' 'compress > NOTSOBIG.Z' lrecl=32760 ; lrecl=32760 ; ordinary file "unnamed pipe“ data _null_ ; data _null_ ; set OURDATA ; set OURDATA ; file OUT ; file OUT ; put <field>…<field> ; put <field>…<field> ; run ; run ; No changes to data step, only to filename. When you uncompress, supply name BIG.TXT. 0 9 © 2008 by Great Works Informatics, LLC A simple modification: No changes were necessary to the data step, only to the filename statement. Since the "compress" command only received a bytestream, it couldn't and didn't record any file name within the compressed file. When the file is subsequently uncompressed, the program will prompt for a file name to be supplied.
  • 10. Original code filename OUT 'BIG.TXT' lrecl=32760 ; data _null_ ; set ONEDATA ; file OUT ; put <field>…<field> ; run ; data _null_ ; set TWODATA ; file OUT mod ; put <field>…<field> ; run ; 10 © 2008 by Great Works Informatics, LLC Slightly more complex code, closer to the actual original code -- multiple data steps write to the same output file. (Original code had six data steps, two are sufficient to demonstrate the issues we will discuss.) Each data step after the first one appends to the output file, by using the "mod" option of the file statement.
  • 11. Original code :: Modified code filename OUT filename OUT pipe 'BIG.TXT' 'compress > NOTSOBIG.Z' lrecl=32760 ; lrecl=32760 ; data _null_ ; data _null_ ; set ONEDATA ; set ONEDATA ; file OUT ; file OUT ; put <field>…<field> ; put <field>…<field> ; run ; run ; data _null_ ; data _null_ ; set TWODATA ; set TWODATA ; file OUT mod ; file OUT mod ; put <field>…<field> ; put <field>…<field> ; run ; run ; No changes to data steps – problem! Program compress gets called a 2nd time, and overwrites NOTSOBIG.Z. 11 © 2008 by Great Works Informatics, LLC Only TWODATA is in the resulting file. If we follow the pattern we know, we modify just the "filename" statement. Each data step starts the "compress" program anew, and overwrites the previous content of the output file NOTSOBIG.Z. The "mod" option of the "file" statement has no effect here, when it refers to an unnamed pipe.
  • 12. Original code :: Modified code filename OUT filename OUT pipe 'BIG.TXT' 'compress > NOTSOBIG.Z' lrecl=32760 ; lrecl=32760 ; data _null_ ; file OUT ; data _null_ ; set ONEDATA nobs=CONSTNOBS1 ; set ONEDATA ; do POINTVAR = 1 to CONSTNOBS1 ; file OUT ; set ONEDATA point=POINTVAR ; put <field>…<field> ; put <field>…<field> ; run ; end ; data _null_ ; set TWODATA nobs=CONSTNOBS2 ; set TWODATA ; do POINTVAR = 1 to CONSTNOBS2 ; file OUT mod ; set TWODATA point=POINTVAR ; put <field>…<field> ; put <field>…<field> ; run ; end ; stop ; /*<=REMEMBER!*/ run ; Only one data step, one call to compress. 12 © 2008 by Great Works Informatics, LLC Problem if ONEDATA has zero observations. Solution attempt: Combine the multiple data steps into one single data step, so there is only one invocation of the "compress" program. [When you take over from the normal SAS data cycle, you also have to include logic to stop execution.] Note -- the single variable "POINTVAR" is used in multiple loops with no problem. The two different "nobs=" variables in the two separate "set" statements must be different from each other, else one interferes with the other at compile time. Works in most cases, but: If any data set before the last has zero observations, execution stops when its "set" statement is executed, and the output file is incomplete.
  • 13. Original code :: Modified code filename OUT filename OUT pipe 'BIG.TXT' 'compress > NOTSOBIG.Z' lrecl=32760 ; lrecl=32760 ; data _null_ ; file OUT ; if 0 then data _null_ ; set ONEDATA nobs=CONSTNOBS1 ; set ONEDATA ; do POINTVAR = 1 to CONSTNOBS1 ; file OUT ; set ONEDATA point=POINTVAR ; put <field>…<field> ; put <field>…<field> ; run ; end ; if 0 then data _null_ ; set TWODATA nobs=CONSTNOBS2 ; set TWODATA ; do POINTVAR = 1 to CONSTNOBS2 ; file OUT mod ; set TWODATA point=POINTVAR ; put <field>…<field> ; put <field>…<field> ; run ; end ; stop ; /*<=REMEMBER!*/ run ; 13 © 2008 by Great Works Informatics, LLC Works fine, looks a little cluttered. The "set" statements where the values of CONSTNOBS1 and CONSTNOBS2 are set have their effect at compile time, not at execution time. So, they don't ever have to be executed, they simply have to be present at compile time. They could be anywhere within the data step, they simply have to be present at compile time. As an alternative to prefixing each "set ... nobs=" statement with "if 0" or "if 1 = 2" etc., we could move them to after the "stop" statement.
  • 14. Original code :: Modified code filename OUT filename OUT pipe 'BIG.TXT' 'compress > NOTSOBIG.Z' lrecl=32760 ; lrecl=32760 ; data _null_ ; file OUT ; data _null_ ; do POINTVAR = 1 to CONSTNOBS1 ; set ONEDATA ; set ONEDATA point=POINTVAR file OUT ; nobs=CONSTNOBS1 ; put <field>…<field> ; put <field>…<field> ; run ; end ; data _null_ ; do POINTVAR = 1 to CONSTNOBS2 ; set TWODATA ; set TWODATA point=POINTVAR file OUT mod ; nobs=CONSTNOBS2 ; put <field>…<field> ; put <field>…<field> ; run ; end ; stop ; /*<=REMEMBER!*/ run ; 14 © 2008 by Great Works Informatics, LLC The compile-time "nobs=" and the execution-time "point=" options of the "set" statement can both be present on a single "set" statement. And, if any data set before the last has zero observations, its "set" statement is not actually executed at run time because its "nobs=" variable is set to zero at compile time and that "do" loop reduces to "do POINTVAR = 1 to 0 ;" and execution does not enter that "do" loop.
  • 15. DATA Step -- compilation phase v. execution phase http://v8doc.sas.com SAS OnlineDoc . Base SAS Software . . SAS Language Reference: Concepts . . . DATA Step Concepts . . . . DATA Step Processing . . . . . Overview of DATA Step Processing Flow of Action When you submit a DATA step for execution, it is first compiled and then executed. . 15 © 2008 by Great Works Informatics, LLC
  • 16. SET http://v8doc.sas.com SAS OnlineDoc . Base SAS Software . . SAS Language Reference: Dictionary . . . Dictionary of Language Elements . . . . Statements . . . . . SET NOBS=variable At compilation time, SAS reads the descriptor portion of each data set and assigns the value of the NOBS= variable automatically. Thus, you can refer to the NOBS= variable before the SET statement. POINT=variable POINT= causes the SET statement to use random (direct) access to read a SAS data set. CAUTION: Continuous loops can occur when you use the POINT= option. When you use the POINT= option, you must include a STOP statement to stop DATA step processing, programming logic that checks for an invalid value of the POINT= variable, or both. 16 © 2008 by Great Works Informatics, LLC
  • 17. Reading from and Writing to UNIX Commands (PIPE) http://v8doc.sas.com SAS OnlineDoc . Base SAS Software . . Host Specific Information . . . UNIX Environments (Companion) . . . . Running the SAS System Under UNIX . . . . . Using External Files and Devices . . . . . . Reading from and Writing to UNIX Commands (PIPE) FILENAME fileref PIPE 'UNIX-command' <options>; Under UNIX, you can use the FILENAME statement to assign filerefs not only to external files and I/O devices, but also to a pipe. Pipes enable your SAS application to receive input from any UNIX command that writes to standard output and to route output to any UNIX command that reads from standard input. . 17 © 2008 by Great Works Informatics, LLC
  • 18. Using Unnamed Pipes (MS Windows) http://v8doc.sas.com SAS OnlineDoc . Base SAS Software . . Host Specific Information . . . Microsoft Windows Environment (Companion) . . . . Using SAS with Other Windows Applications . . . . . Using Unnamed and Named Pipes . . . . . . Using Unnamed Pipes FILENAME fileref PIPE 'program-name' option-list NOTE: The infile DIR is: Example: Unnamed Pipe Access Device, filename DIR pipe 'dir /?' ; PROCESS=dir /?,RECFM=V,LRECL=256 data _null_ ; Displays a list of files and subdirectories in a directory. infile DIR ; ... input ; NOTE: 38 records were read from the infile DIR. . The minimum record length was 0. put _infile_ ; The maximum record length was 75. run ; NOTE: DATA statement used (Total process time): real time 0.09 seconds cpu time 0.03 seconds 18 © 2008 by Great Works Informatics, LLC Unfortunately, I couldn't find any compression program under MS Windows which would read a file to be compressed from its standard input.
  • 19. Questions? SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are registered trademarks or trademarks of their respective companies. 19 © 2008 by Great Works Informatics, LLC