3. 3
Introductions
• My Background
• PROC IMPORT Basics
• CSV Specification and Issues
• Fixing CSV Data Outside of SAS
• Tips and Tricks
4. Abstract
• PROC IMPORT comes in handy when quickly trying to load a CSV or similar file. But
it does have limitations. Unfortunately, I've run into those limitations and had to
work around them.
• This session will discuss the original CSV specification (early 1980's), how Microsoft
Excel violates that specification, how SAS PROC IMPORT does not follow that
specification, and the issues that can result.
• Simple UNIX tools will be described that can be used to ensure that data hilarities
do not occur due to CSV issues.
• Recommendations will be made to get around some of PROC IMPORT limitations
(like field naming, data type determination, limitation in number of fields,
separator in data).
• CSV, TAB, and DLM types will be discussed.
4
5. 5
My Background
• Base SAS on Mainframe, UNIX, and PC Platforms
• SAS is primarily an ETL tool or Programming Language for me
• My background is IT – I am not a modeler
• Not my first User Group presentation – presented sessions and
seminars in Australia, France, the US, Oxford England (about the
British Author Nevil Shute), and Canada.
• Undergraduate: Computer and Information Sciences, Temple Univ.
• Graduate: Organizational Dynamics, UPenn
• Most of my career was in consulting
• Have written several books (none SAS-related, yet)
• Adjunct Instructor covering IT topics.
6. 6
PROC IMPORT Basics
• Will focus on code rather than GUI Import Wizard
• Handy tool to load data from other sources
• You provide the input (flat) file, SAS creates the dataset
• The primary formats:
– CSV – Comma Separated Variables
– TAB – Tab (ASCII ‘09x’ EBCDIC ’05’x)
– DLM – You pick it
• Quoted strings get special treatment any time you use delimited
files
– PROC IMPORT
– INFILE with DLM and DSD (“Delimiter-Sensitive Data”)
7. 7
PROC IMPORT Basics
• Syntax:
PROC IMPORT
DATAFILE="filename”
OUT=<libref.>SAS data-set <(SAS data-set-options)>
<DBMS=identifier>
<REPLACE> ;
<data-source-statement(s);>
• DBMS identifies data source type
– CSV, TAB, and DLM will be discussed here
– DBMS options could be an entire talk in themselves and largely
depends on operating system and installed components.
• REPLACE allows output dataset to be overwritten.
8. 8
PROC IMPORT Basics
• Data-source-statements:
– DELIMITER only specified with DBMS=DLM
• Will override DBMS=CSV
• Can contain multiple values DELIMTER=‘!|’
– GETNAMES gets variable names from file
• GETNAMES=YES is the default
• Will “sasify” the names (underlines replace spaces, length, etc.)
• Defaults DATAROW to 2
– DATAROW=n
• Is similar in function to FIRSTOBS
• Defaults to 1 unless GETNAMES=YES
– GUESSINGROWS=20 is the default
• Used by wizard to build format and input statements
• Tradeoff between time and accuracy – higher count gives greater accuracy
9. 9
Example from SAS Documentation
• The input CSV:
"Africa","Boot","Addis Ababa","12","$29,761","$191,821","$769"
"Asia","Boot","Bangkok","1","$1,996","$9,576","$80"
"Canada","Boot","Calgary","8","$17,720","$63,280","$472"
"Central America/Caribbean","Boot","Kingston","33","$102,372","$393,376","$4,454"
"Eastern Europe","Boot","Budapest","22","$74,102","$317,515","$3,341"
"Middle East","Boot","Al-Khobar","10","$15,062","$44,658","$765"
"Pacific","Boot","Auckland","12","$20,141","$97,919","$962"
"South America","Boot","Bogota","19","$15,312","$35,805","$1,229"
"United States","Boot","Chicago","16","$82,483","$305,061","$3,735"
"Western Europe","Boot","Copenhagen","2","$1,663","$4,657","$129“
• The Code:
proc import datafile=“csv1.csv"
out=shoes dbms=csv replace;
getnames=no;
run;
10. 10
Example from SAS Documentation
• Generated Code:
7 /**********************************************************************
8 * PRODUCT: SAS
9 * VERSION: 9.2
10 * CREATOR: External File Interface
11 * DATE: 10JUN14
12 * DESC: Generated SAS Datastep Code
13 * TEMPLATE SOURCE: (None Specified.)
14 ***********************************************************************/
15 data WORK.SHOES ;
16 %let _EFIERR_ = 0; /* set the ERROR detection macro variable */
17 infile 'csv1.csv' delimiter = ',' MISSOVER DSD lrecl=32767 ;
18 informat VAR1 $34. ;
...
25 format VAR1 $34. ;
...
32 input
33 VAR1 $
...
40 ;
41 if _ERROR_ then call symputx('_EFIERR_',1); /* set ERROR detection
macro variable */
42 run;
11. 11
Example from SAS Documentation
• Output:
Obs VAR1 VAR2 VAR3
VAR4 VAR5 VAR6 VAR7
1 Africa Boot Addis
Ababa 12 $29,761 $191,821 $769
2 Asia Boot Bangkok
1 $1,996 $9,576 $80
3 Canada Boot Calgary
8 $17,720 $63,280 $472
4 Central America/Caribbean Boot Kingston
33 $102,372 $393,376 $4,454
5 Eastern Europe Boot Budapest
22 $74,102 $317,515 $3,341
...
10 Western Europe Boot Copenhagen
2 $1,663 $4,657 $129
12. 12
CSV Specification and Issues
• When CSV first came out, there was no real specification
• History traced to list directed/free form input specification in
FORTRAN 77
• First mention I could find (in SAS’ TS-673) was version 6
– Did not handle empty fields (two delimiters in a row)
– Did not handle quoted strings containing delimiter
• Each vendor had their own “standard” which may or may not
support quoted strings
• For a long time, Excel did not quote strings
• The creator of your file may not quote strings
• RFC 4180, IETF 2005, is the definitive reference now!
13. 13
CSV Specification and Issues
• The format is basically:
– Optional Header (usually column names):
field_name,field_name,field_name
– Followed by Datalines:
aaa,bbb,ccc
zzz,yyy,xxx
– Optionally, strings are contained within single or double quotes
"aaa","bbb","ccc"
– If a quoted string contains the quote character, it must be escaped:
• "Af"r,ica" Becomes “Af”r and ica” fields
• "Af”"r,ica" Becomes “Af”r,ica” (a single field)
– Each line should have a line terminator
• The specification really only matters to you if you have problems
importing a particular file.
14. 14
CSV Specification and Issues
• What happens if delimiter-containing fields are not quoted?
– Depends on consistency of data
– OK:
Applicant, Instructor
Resources, Human
Management, Fiscal
– Not OK:
Applicant, Instructor
Resources, Human
Training
• Inconsistent number of fields
• Column names exist in some files (but not all files)
• Page titles
15. 15
Fixing CSV Data Outside of SAS
• I strongly advise against CSV in favor of TAB
delimited
• Second choice is delimited by other character
(like ‘|’ pipe)
• Ask for a new version properly formatted
• Manually edit the file
• Use some UNIX/Linux and SAS tricks
16. 16
Tips and Tricks – Modifying Generated Code
• My favorite trick is to let PROC IMPORT write the initial code for a file and
then copy/paste/edit the resulting DATA step into the program.
• Can alter variable names, formats, and data types
• Can use more INFILE options like DLMSTR, DLMSOPT, END, others
• Can perform data validation/correction
– Excel will let you enter ‘14-JUL’ as a date, treat it as July 14 2014, but
write it out to the file as ‘14-JUL’
– Humans are lazy.
– Your program could input the field and correct in on flow through the
data.
• Programmers are lazy. I don’t want to have to type all the variables in a
large delimited file.
• UNIX has a tool to cut columns out of a file:
– cut -c 10- csv3.log > data.txt
– Will eliminate row numbers and formatting in log columns 1-9
17. 17
Tips and Tricks – Hanging PROC
• If you see this error when using PROC IMPORT or EXPORT while
running from the command line:
ERROR: Cannot open X display.
Check the name/server access authorization.
• Then the wizard is trying to run interactively. Your command line
should include:
sas –noterminal
• Supposed to be fixed in v9.3
• Another symptom is that the PROC “hangs”
18. 18
Tips and Tricks – Delimited File Not Quote Escaped
• Problem reported recently by Don Stanley
bchr0001@GMAIL.COM
• This is the case where quotes are just data:
AAAA.BBBB.'I.CCCC'.0.DL
• PROC IMPORT/INFILE DLM DSD wants to treat 'I.CCCC‘ as
one field, developer wants it to be two.
• With the possibility of empty fields, partial DSD functionality
is needed. But DSD also invokes the quoted field handling.
19. 19
Tips and Tricks – Delimited File Not Quote Escaped
• SAS Solution (I lost the contributor name)
infile custdata dlm=‘,' ;
input @ ;
do while(index(_infile_,‘,,')>0) ;
_infile_ = tranwrd(_infile_,‘,,',‘, ,') ;
end ; /* comma-comma to comma-space-comma */
input << delimited fields >> ;
– _infile_ is limited to 32,767 – could be problem for large records
• UNIX Shell Solution
awk '{gsub(“,,", “, ,");print $0;}' INPUTFILE >
OUTPUTFILE
– Handles (essentially) unlimited record lengths
20. 20
Tips and Tricks – File Contents
• UNIX head will show the first few rows
– Does the file have extra lines at the top (like a title line)?
• UNIX tail will show the last few rows
– Do I have all the data?
– Are the records consistent beginning to end?
• UNIX wc -l (word count, line) will show how many lines in total
– Is this file the right size?
• You can always edit the file (unless too large)
• UNIX od -c -x (object dump, character and hex) is useful to see
format and contents of file (including unprintable characters).
21. 21
Tips and Tricks – File Contents
• Determine maximum record length and field count (based on delimiters)
with a script
• This example is written in awk (new awk):
BEGIN{ FS=","; mNF=0; mRL=0; }
{
if (NF > mNF) mNF=NF;
RL=length($0);
if (RL > mRL) mRL=RL;
}
END{ print "max NF " mNF " max RECL " mRL;}
• Max NF is highest number of fields on any record (does not handle quote
escaping)
• Max RECL is the longest record length in the file
• Can be saved in a file and executed on your data:
awk –f csv_check.awk INPUTFILE.csv
22. 22
Tips and Tricks – Dropping Titles
• The input CSV:
This is a title line and should be here
country,shoe,city,count,average salary,max salary,average
"Africa","Boot","Addis Ababa","12","$29,761","$191,821","$769"
"Asia","Boot","Bangkok","1","$1,996","$9,576","$80"
"Canada","Boot","Calgary","8","$17,720","$63,280","$472“
• PROC IMPORT results:
Number of names found is less than number of variables found.
Name This is a title line and should be here truncated to
This_is_a_title_line_and_should.
Problems were detected with provided names. See LOG.
• How to get rid of the first line since DATAROW= won’t help:
– wc -l INPUT – returns value 5 (5 lines)
– tail -4 INPUT > OUTPUT – saves last 4 lines
23. 23
Tips and Tricks – Dropping Junk Records at End
• The input CSV:
country,shoe,city,count,average salary,max salary,average
"Africa","Boot","Addis Ababa","12","$29,761","$191,821","$769"
"Asia","Boot","Bangkok","1","$1,996","$9,576","$80"
"Canada","Boot","Calgary","8","$17,720","$63,280","$472"
,,,,,,
• Results in:
average_
Obs country shoe city count salary max_salary
average
1 Africa Boot Addis Ababa 12 $29,761 $191,821 $769
2 Asia Boot Bangkok 1 $1,996 $9,576 $80
3 Canada Boot Calgary 8 $17,720 $63,280 $472
4
• How to get rid of the last line since DATAROW= won’t help:
– wc -l INPUT – returns value 5 (5 lines)
– head -4 INPUT > OUTPUT – saves first 4 lines
24. 24
Personal Note
• On a Personal Note
– I seem to learn a bunch when working on presentations,
new classes, and writings
– I was still treating PROC IMPORT like I was still using V6, I
didn’t realize it would handle quoted strings properly now.
– I have to learn some new habits and trust the tool more.
– In any commands within this presentation, the single and
double quotation marks should be simple, not the “smart
quotes” forced my Microsoft. The same applies to dashes
or minus signs – they should not be “em dashes” (- versus
–)