SlideShare a Scribd company logo
1 of 26
Download to read offline
BB-180-2017
PROC IMPORT and more
Or
When PROC IMPORT just doesn't
do the job
David B. Horvath, CCP, MS
2
PROC IMPORT and more
The Author can be contacted at:
504 Longbotham Drive, Aston PA 19014-2502, USA
Phone: 1-610-859-8826
Email: dhorvath@cobs.com
Web: http://www.cobs.com/
All trademarks and servicemarks are the
property of their respective owners.
Copyright © 2017, David B. Horvath, CCP — All Rights Reserved
3
Introductions
• My Background
• PROC IMPORT Basics
• CSV Specification and Issues
• Fixing CSV Data Outside of SAS
• Tips and Tricks
Abstract
• PROC IMPORT comes in handy when quickly trying to load a CSV or similar file. But
it does have limitations. Unfortunately, I've run into those limitations and had to
work around them.
• This session will discuss the original CSV specification (early 1980's), how Microsoft
Excel violates that specification, how SAS PROC IMPORT does not follow that
specification, and the issues that can result.
• Simple UNIX tools will be described that can be used to ensure that data hilarities
do not occur due to CSV issues.
• Recommendations will be made to get around some of PROC IMPORT limitations
(like field naming, data type determination, limitation in number of fields,
separator in data).
• CSV, TAB, and DLM types will be discussed.
4
5
My Background
• Base SAS on Mainframe, UNIX, and PC Platforms
• SAS is primarily an ETL tool or Programming Language for me
• My background is IT – I am not a modeler
• Not my first User Group presentation – presented sessions and
seminars in Australia, France, the US, Oxford England (about the
British Author Nevil Shute), and Canada.
• Undergraduate: Computer and Information Sciences, Temple Univ.
• Graduate: Organizational Dynamics, UPenn
• Most of my career was in consulting
• Have written several books (none SAS-related, yet)
• Adjunct Instructor covering IT topics.
6
PROC IMPORT Basics
• Will focus on code rather than GUI Import Wizard
• Handy tool to load data from other sources
• You provide the input (flat) file, SAS creates the dataset
• The primary formats:
– CSV – Comma Separated Variables
– TAB – Tab (ASCII ‘09x’ EBCDIC ’05’x)
– DLM – You pick it
• Quoted strings get special treatment any time you use delimited
files
– PROC IMPORT
– INFILE with DLM and DSD (“Delimiter-Sensitive Data”)
7
PROC IMPORT Basics
• Syntax:
PROC IMPORT
DATAFILE="filename”
OUT=<libref.>SAS data-set <(SAS data-set-options)>
<DBMS=identifier>
<REPLACE> ;
<data-source-statement(s);>
• DBMS identifies data source type
– CSV, TAB, and DLM will be discussed here
– DBMS options could be an entire talk in themselves and largely
depends on operating system and installed components.
• REPLACE allows output dataset to be overwritten.
8
PROC IMPORT Basics
• Data-source-statements:
– DELIMITER only specified with DBMS=DLM
• Will override DBMS=CSV
• Can contain multiple values DELIMTER=‘!|’
– GETNAMES gets variable names from file
• GETNAMES=YES is the default
• Will “sasify” the names (underlines replace spaces, length, etc.)
• Defaults DATAROW to 2
– DATAROW=n
• Is similar in function to FIRSTOBS
• Defaults to 1 unless GETNAMES=YES
– GUESSINGROWS=20 is the default
• Used by wizard to build format and input statements
• Tradeoff between time and accuracy – higher count gives greater accuracy
9
Example from SAS Documentation
• The input CSV:
"Africa","Boot","Addis Ababa","12","$29,761","$191,821","$769"
"Asia","Boot","Bangkok","1","$1,996","$9,576","$80"
"Canada","Boot","Calgary","8","$17,720","$63,280","$472"
"Central America/Caribbean","Boot","Kingston","33","$102,372","$393,376","$4,454"
"Eastern Europe","Boot","Budapest","22","$74,102","$317,515","$3,341"
"Middle East","Boot","Al-Khobar","10","$15,062","$44,658","$765"
"Pacific","Boot","Auckland","12","$20,141","$97,919","$962"
"South America","Boot","Bogota","19","$15,312","$35,805","$1,229"
"United States","Boot","Chicago","16","$82,483","$305,061","$3,735"
"Western Europe","Boot","Copenhagen","2","$1,663","$4,657","$129“
• The Code:
proc import datafile=“csv1.csv"
out=shoes dbms=csv replace;
getnames=no;
run;
10
Example from SAS Documentation
• Generated Code:
7 /**********************************************************************
8 * PRODUCT: SAS
9 * VERSION: 9.2
10 * CREATOR: External File Interface
11 * DATE: 10JUN14
12 * DESC: Generated SAS Datastep Code
13 * TEMPLATE SOURCE: (None Specified.)
14 ***********************************************************************/
15 data WORK.SHOES ;
16 %let _EFIERR_ = 0; /* set the ERROR detection macro variable */
17 infile 'csv1.csv' delimiter = ',' MISSOVER DSD lrecl=32767 ;
18 informat VAR1 $34. ;
...
25 format VAR1 $34. ;
...
32 input
33 VAR1 $
...
40 ;
41 if _ERROR_ then call symputx('_EFIERR_',1); /* set ERROR detection
macro variable */
42 run;
11
Example from SAS Documentation
• Output:
Obs VAR1 VAR2 VAR3
VAR4 VAR5 VAR6 VAR7
1 Africa Boot Addis
Ababa 12 $29,761 $191,821 $769
2 Asia Boot Bangkok
1 $1,996 $9,576 $80
3 Canada Boot Calgary
8 $17,720 $63,280 $472
4 Central America/Caribbean Boot Kingston
33 $102,372 $393,376 $4,454
5 Eastern Europe Boot Budapest
22 $74,102 $317,515 $3,341
...
10 Western Europe Boot Copenhagen
2 $1,663 $4,657 $129
12
CSV Specification and Issues
• When CSV first came out, there was no real specification
• History traced to list directed/free form input specification in
FORTRAN 77
• First mention I could find (in SAS’ TS-673) was version 6
– Did not handle empty fields (two delimiters in a row)
– Did not handle quoted strings containing delimiter
• Each vendor had their own “standard” which may or may not
support quoted strings
• For a long time, Excel did not quote strings
• The creator of your file may not quote strings
• RFC 4180, IETF 2005, is the definitive reference now!
13
CSV Specification and Issues
• The format is basically:
– Optional Header (usually column names):
field_name,field_name,field_name
– Followed by Datalines:
aaa,bbb,ccc
zzz,yyy,xxx
– Optionally, strings are contained within single or double quotes
"aaa","bbb","ccc"
– If a quoted string contains the quote character, it must be escaped:
• "Af"r,ica" Becomes “Af”r and ica” fields
• "Af”"r,ica" Becomes “Af”r,ica” (a single field)
– Each line should have a line terminator
• The specification really only matters to you if you have problems
importing a particular file.
14
CSV Specification and Issues
• What happens if delimiter-containing fields are not quoted?
– Depends on consistency of data
– OK:
Applicant, Instructor
Resources, Human
Management, Fiscal
– Not OK:
Applicant, Instructor
Resources, Human
Training
• Inconsistent number of fields
• Column names exist in some files (but not all files)
• Page titles
15
Fixing CSV Data Outside of SAS
• I strongly advise against CSV in favor of TAB
delimited
• Second choice is delimited by other character
(like ‘|’ pipe)
• Ask for a new version properly formatted
• Manually edit the file
• Use some UNIX/Linux and SAS tricks
16
Tips and Tricks – Modifying Generated Code
• My favorite trick is to let PROC IMPORT write the initial code for a file and
then copy/paste/edit the resulting DATA step into the program.
• Can alter variable names, formats, and data types
• Can use more INFILE options like DLMSTR, DLMSOPT, END, others
• Can perform data validation/correction
– Excel will let you enter ‘14-JUL’ as a date, treat it as July 14 2014, but
write it out to the file as ‘14-JUL’
– Humans are lazy.
– Your program could input the field and correct in on flow through the
data.
• Programmers are lazy. I don’t want to have to type all the variables in a
large delimited file.
• UNIX has a tool to cut columns out of a file:
– cut -c 10- csv3.log > data.txt
– Will eliminate row numbers and formatting in log columns 1-9
17
Tips and Tricks – Hanging PROC
• If you see this error when using PROC IMPORT or EXPORT while
running from the command line:
ERROR: Cannot open X display.
Check the name/server access authorization.
• Then the wizard is trying to run interactively. Your command line
should include:
sas –noterminal
• Supposed to be fixed in v9.3
• Another symptom is that the PROC “hangs”
18
Tips and Tricks – Delimited File Not Quote Escaped
• Problem reported recently by Don Stanley
bchr0001@GMAIL.COM
• This is the case where quotes are just data:
AAAA.BBBB.'I.CCCC'.0.DL
• PROC IMPORT/INFILE DLM DSD wants to treat 'I.CCCC‘ as
one field, developer wants it to be two.
• With the possibility of empty fields, partial DSD functionality
is needed. But DSD also invokes the quoted field handling.
19
Tips and Tricks – Delimited File Not Quote Escaped
• SAS Solution (I lost the contributor name)
infile custdata dlm=‘,' ;
input @ ;
do while(index(_infile_,‘,,')>0) ;
_infile_ = tranwrd(_infile_,‘,,',‘, ,') ;
end ; /* comma-comma to comma-space-comma */
input << delimited fields >> ;
– _infile_ is limited to 32,767 – could be problem for large records
• UNIX Shell Solution
awk '{gsub(“,,", “, ,");print $0;}' INPUTFILE >
OUTPUTFILE
– Handles (essentially) unlimited record lengths
20
Tips and Tricks – File Contents
• UNIX head will show the first few rows
– Does the file have extra lines at the top (like a title line)?
• UNIX tail will show the last few rows
– Do I have all the data?
– Are the records consistent beginning to end?
• UNIX wc -l (word count, line) will show how many lines in total
– Is this file the right size?
• You can always edit the file (unless too large)
• UNIX od -c -x (object dump, character and hex) is useful to see
format and contents of file (including unprintable characters).
21
Tips and Tricks – File Contents
• Determine maximum record length and field count (based on delimiters)
with a script
• This example is written in awk (new awk):
BEGIN{ FS=","; mNF=0; mRL=0; }
{
if (NF > mNF) mNF=NF;
RL=length($0);
if (RL > mRL) mRL=RL;
}
END{ print "max NF " mNF " max RECL " mRL;}
• Max NF is highest number of fields on any record (does not handle quote
escaping)
• Max RECL is the longest record length in the file
• Can be saved in a file and executed on your data:
awk –f csv_check.awk INPUTFILE.csv
22
Tips and Tricks – Dropping Titles
• The input CSV:
This is a title line and should be here
country,shoe,city,count,average salary,max salary,average
"Africa","Boot","Addis Ababa","12","$29,761","$191,821","$769"
"Asia","Boot","Bangkok","1","$1,996","$9,576","$80"
"Canada","Boot","Calgary","8","$17,720","$63,280","$472“
• PROC IMPORT results:
Number of names found is less than number of variables found.
Name This is a title line and should be here truncated to
This_is_a_title_line_and_should.
Problems were detected with provided names. See LOG.
• How to get rid of the first line since DATAROW= won’t help:
– wc -l INPUT – returns value 5 (5 lines)
– tail -4 INPUT > OUTPUT – saves last 4 lines
23
Tips and Tricks – Dropping Junk Records at End
• The input CSV:
country,shoe,city,count,average salary,max salary,average
"Africa","Boot","Addis Ababa","12","$29,761","$191,821","$769"
"Asia","Boot","Bangkok","1","$1,996","$9,576","$80"
"Canada","Boot","Calgary","8","$17,720","$63,280","$472"
,,,,,,
• Results in:
average_
Obs country shoe city count salary max_salary
average
1 Africa Boot Addis Ababa 12 $29,761 $191,821 $769
2 Asia Boot Bangkok 1 $1,996 $9,576 $80
3 Canada Boot Calgary 8 $17,720 $63,280 $472
4
• How to get rid of the last line since DATAROW= won’t help:
– wc -l INPUT – returns value 5 (5 lines)
– head -4 INPUT > OUTPUT – saves first 4 lines
24
Personal Note
• On a Personal Note
– I seem to learn a bunch when working on presentations,
new classes, and writings
– I was still treating PROC IMPORT like I was still using V6, I
didn’t realize it would handle quoted strings properly now.
– I have to learn some new habits and trust the tool more.
– In any commands within this presentation, the single and
double quotation marks should be simple, not the “smart
quotes” forced my Microsoft. The same applies to dashes
or minus signs – they should not be “em dashes” (- versus
–)
25
Wrap Up
Questions
and
Answers
?! ?!
?! ?!
?
? ?
?
!
!
!
!
26
Proc Import References
• NOTES
– Proc Import with a twist:
http://www2.sas.com/proceedings/sugi30/038-30.pdf
– Data Example:
http://support.sas.com/documentation/cdl/en/proc/6189
5/HTML/default/viewer.htm#a000314361.htm
– RFC 4180, IETF 2005, http://tools.ietf.org/html/rfc4180
– Wiki has good CSV background:
http://en.wikipedia.org/wiki/Comma-separated_values
– Hanging PROC IMPORT:
http://support.sas.com/kb/3/610.html
– CSV-1203 “Standard” – identifies some Excel “oddities”:
http://mastpoint.curzonnassau.com/csv-1203

More Related Content

Similar to 20171106 sesug bb 180 proc import ppt

Saurabh_Patel_An Alternative way to Import Multiple Excel files with Multiple...
Saurabh_Patel_An Alternative way to Import Multiple Excel files with Multiple...Saurabh_Patel_An Alternative way to Import Multiple Excel files with Multiple...
Saurabh_Patel_An Alternative way to Import Multiple Excel files with Multiple...
Saurabh Patel
 
Wei's Self Intro
Wei's Self IntroWei's Self Intro
Wei's Self Intro
sunmast
 

Similar to 20171106 sesug bb 180 proc import ppt (20)

Hadoop for the Absolute Beginner
Hadoop for the Absolute BeginnerHadoop for the Absolute Beginner
Hadoop for the Absolute Beginner
 
TYPO3 Transition Tool
TYPO3 Transition ToolTYPO3 Transition Tool
TYPO3 Transition Tool
 
C++ 11 Style : A Touch of Class
C++ 11 Style : A Touch of ClassC++ 11 Style : A Touch of Class
C++ 11 Style : A Touch of Class
 
01 isa
01 isa01 isa
01 isa
 
PHP - Introduction to Advanced SQL
PHP - Introduction to Advanced SQLPHP - Introduction to Advanced SQL
PHP - Introduction to Advanced SQL
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift
 
Vancouver AWS Meetup Slides 11-20-2018 Apache Spark with Amazon EMR
Vancouver AWS Meetup Slides 11-20-2018 Apache Spark with Amazon EMRVancouver AWS Meetup Slides 11-20-2018 Apache Spark with Amazon EMR
Vancouver AWS Meetup Slides 11-20-2018 Apache Spark with Amazon EMR
 
From a student to an apache committer practice of apache io tdb
From a student to an apache committer  practice of apache io tdbFrom a student to an apache committer  practice of apache io tdb
From a student to an apache committer practice of apache io tdb
 
Saurabh_Patel_An Alternative way to Import Multiple Excel files with Multiple...
Saurabh_Patel_An Alternative way to Import Multiple Excel files with Multiple...Saurabh_Patel_An Alternative way to Import Multiple Excel files with Multiple...
Saurabh_Patel_An Alternative way to Import Multiple Excel files with Multiple...
 
D Maeda Bi Portfolio
D Maeda Bi PortfolioD Maeda Bi Portfolio
D Maeda Bi Portfolio
 
Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache Drill
 
Using existing language skillsets to create large-scale, cloud-based analytics
Using existing language skillsets to create large-scale, cloud-based analyticsUsing existing language skillsets to create large-scale, cloud-based analytics
Using existing language skillsets to create large-scale, cloud-based analytics
 
Master tuning
Master   tuningMaster   tuning
Master tuning
 
Wei's Self Intro
Wei's Self IntroWei's Self Intro
Wei's Self Intro
 
The Right Data for the Right Job
The Right Data for the Right JobThe Right Data for the Right Job
The Right Data for the Right Job
 
20180324 leveraging unix tools
20180324 leveraging unix tools20180324 leveraging unix tools
20180324 leveraging unix tools
 
Optimizing columnar stores
Optimizing columnar storesOptimizing columnar stores
Optimizing columnar stores
 
Optimizing columnar stores
Optimizing columnar storesOptimizing columnar stores
Optimizing columnar stores
 
02 isa
02 isa02 isa
02 isa
 
Uklug 2014 connections dev faq
Uklug 2014  connections dev faqUklug 2014  connections dev faq
Uklug 2014 connections dev faq
 

More from David Horvath

More from David Horvath (9)

20190413 zen and the art of programming
20190413 zen and the art of programming20190413 zen and the art of programming
20190413 zen and the art of programming
 
(SAS) UNIX X Command Tips and Tricks
(SAS) UNIX X Command Tips and Tricks(SAS) UNIX X Command Tips and Tricks
(SAS) UNIX X Command Tips and Tricks
 
20180414 nevil shute no highway modern metal fatigue
20180414 nevil shute no highway modern metal fatigue20180414 nevil shute no highway modern metal fatigue
20180414 nevil shute no highway modern metal fatigue
 
20180410 sasgf2018 2454 lazy programmers xml ppt
20180410 sasgf2018 2454 lazy programmers xml ppt20180410 sasgf2018 2454 lazy programmers xml ppt
20180410 sasgf2018 2454 lazy programmers xml ppt
 
20180324 zen and the art of programming
20180324 zen and the art of programming20180324 zen and the art of programming
20180324 zen and the art of programming
 
20171106 sesug bb 184 zen and the art of problem solving
20171106 sesug bb 184 zen and the art of problem solving20171106 sesug bb 184 zen and the art of problem solving
20171106 sesug bb 184 zen and the art of problem solving
 
20150904 "A Few Words About 'In The Wet' by Nevil Shute"
20150904 "A Few Words About 'In The Wet' by Nevil Shute"20150904 "A Few Words About 'In The Wet' by Nevil Shute"
20150904 "A Few Words About 'In The Wet' by Nevil Shute"
 
20150312 NOBS for Noobs
20150312 NOBS for Noobs20150312 NOBS for Noobs
20150312 NOBS for Noobs
 
20170419 To COMPRESS or Not, to COMPRESS or ZIP
20170419 To COMPRESS or Not, to COMPRESS or ZIP20170419 To COMPRESS or Not, to COMPRESS or ZIP
20170419 To COMPRESS or Not, to COMPRESS or ZIP
 

Recently uploaded

CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
amitlee9823
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
gajnagarg
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
karishmasinghjnh
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
amitlee9823
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
amitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 

Recently uploaded (20)

CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
 
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 

20171106 sesug bb 180 proc import ppt

  • 1. BB-180-2017 PROC IMPORT and more Or When PROC IMPORT just doesn't do the job David B. Horvath, CCP, MS
  • 2. 2 PROC IMPORT and more The Author can be contacted at: 504 Longbotham Drive, Aston PA 19014-2502, USA Phone: 1-610-859-8826 Email: dhorvath@cobs.com Web: http://www.cobs.com/ All trademarks and servicemarks are the property of their respective owners. Copyright © 2017, David B. Horvath, CCP — All Rights Reserved
  • 3. 3 Introductions • My Background • PROC IMPORT Basics • CSV Specification and Issues • Fixing CSV Data Outside of SAS • Tips and Tricks
  • 4. Abstract • PROC IMPORT comes in handy when quickly trying to load a CSV or similar file. But it does have limitations. Unfortunately, I've run into those limitations and had to work around them. • This session will discuss the original CSV specification (early 1980's), how Microsoft Excel violates that specification, how SAS PROC IMPORT does not follow that specification, and the issues that can result. • Simple UNIX tools will be described that can be used to ensure that data hilarities do not occur due to CSV issues. • Recommendations will be made to get around some of PROC IMPORT limitations (like field naming, data type determination, limitation in number of fields, separator in data). • CSV, TAB, and DLM types will be discussed. 4
  • 5. 5 My Background • Base SAS on Mainframe, UNIX, and PC Platforms • SAS is primarily an ETL tool or Programming Language for me • My background is IT – I am not a modeler • Not my first User Group presentation – presented sessions and seminars in Australia, France, the US, Oxford England (about the British Author Nevil Shute), and Canada. • Undergraduate: Computer and Information Sciences, Temple Univ. • Graduate: Organizational Dynamics, UPenn • Most of my career was in consulting • Have written several books (none SAS-related, yet) • Adjunct Instructor covering IT topics.
  • 6. 6 PROC IMPORT Basics • Will focus on code rather than GUI Import Wizard • Handy tool to load data from other sources • You provide the input (flat) file, SAS creates the dataset • The primary formats: – CSV – Comma Separated Variables – TAB – Tab (ASCII ‘09x’ EBCDIC ’05’x) – DLM – You pick it • Quoted strings get special treatment any time you use delimited files – PROC IMPORT – INFILE with DLM and DSD (“Delimiter-Sensitive Data”)
  • 7. 7 PROC IMPORT Basics • Syntax: PROC IMPORT DATAFILE="filename” OUT=<libref.>SAS data-set <(SAS data-set-options)> <DBMS=identifier> <REPLACE> ; <data-source-statement(s);> • DBMS identifies data source type – CSV, TAB, and DLM will be discussed here – DBMS options could be an entire talk in themselves and largely depends on operating system and installed components. • REPLACE allows output dataset to be overwritten.
  • 8. 8 PROC IMPORT Basics • Data-source-statements: – DELIMITER only specified with DBMS=DLM • Will override DBMS=CSV • Can contain multiple values DELIMTER=‘!|’ – GETNAMES gets variable names from file • GETNAMES=YES is the default • Will “sasify” the names (underlines replace spaces, length, etc.) • Defaults DATAROW to 2 – DATAROW=n • Is similar in function to FIRSTOBS • Defaults to 1 unless GETNAMES=YES – GUESSINGROWS=20 is the default • Used by wizard to build format and input statements • Tradeoff between time and accuracy – higher count gives greater accuracy
  • 9. 9 Example from SAS Documentation • The input CSV: "Africa","Boot","Addis Ababa","12","$29,761","$191,821","$769" "Asia","Boot","Bangkok","1","$1,996","$9,576","$80" "Canada","Boot","Calgary","8","$17,720","$63,280","$472" "Central America/Caribbean","Boot","Kingston","33","$102,372","$393,376","$4,454" "Eastern Europe","Boot","Budapest","22","$74,102","$317,515","$3,341" "Middle East","Boot","Al-Khobar","10","$15,062","$44,658","$765" "Pacific","Boot","Auckland","12","$20,141","$97,919","$962" "South America","Boot","Bogota","19","$15,312","$35,805","$1,229" "United States","Boot","Chicago","16","$82,483","$305,061","$3,735" "Western Europe","Boot","Copenhagen","2","$1,663","$4,657","$129“ • The Code: proc import datafile=“csv1.csv" out=shoes dbms=csv replace; getnames=no; run;
  • 10. 10 Example from SAS Documentation • Generated Code: 7 /********************************************************************** 8 * PRODUCT: SAS 9 * VERSION: 9.2 10 * CREATOR: External File Interface 11 * DATE: 10JUN14 12 * DESC: Generated SAS Datastep Code 13 * TEMPLATE SOURCE: (None Specified.) 14 ***********************************************************************/ 15 data WORK.SHOES ; 16 %let _EFIERR_ = 0; /* set the ERROR detection macro variable */ 17 infile 'csv1.csv' delimiter = ',' MISSOVER DSD lrecl=32767 ; 18 informat VAR1 $34. ; ... 25 format VAR1 $34. ; ... 32 input 33 VAR1 $ ... 40 ; 41 if _ERROR_ then call symputx('_EFIERR_',1); /* set ERROR detection macro variable */ 42 run;
  • 11. 11 Example from SAS Documentation • Output: Obs VAR1 VAR2 VAR3 VAR4 VAR5 VAR6 VAR7 1 Africa Boot Addis Ababa 12 $29,761 $191,821 $769 2 Asia Boot Bangkok 1 $1,996 $9,576 $80 3 Canada Boot Calgary 8 $17,720 $63,280 $472 4 Central America/Caribbean Boot Kingston 33 $102,372 $393,376 $4,454 5 Eastern Europe Boot Budapest 22 $74,102 $317,515 $3,341 ... 10 Western Europe Boot Copenhagen 2 $1,663 $4,657 $129
  • 12. 12 CSV Specification and Issues • When CSV first came out, there was no real specification • History traced to list directed/free form input specification in FORTRAN 77 • First mention I could find (in SAS’ TS-673) was version 6 – Did not handle empty fields (two delimiters in a row) – Did not handle quoted strings containing delimiter • Each vendor had their own “standard” which may or may not support quoted strings • For a long time, Excel did not quote strings • The creator of your file may not quote strings • RFC 4180, IETF 2005, is the definitive reference now!
  • 13. 13 CSV Specification and Issues • The format is basically: – Optional Header (usually column names): field_name,field_name,field_name – Followed by Datalines: aaa,bbb,ccc zzz,yyy,xxx – Optionally, strings are contained within single or double quotes "aaa","bbb","ccc" – If a quoted string contains the quote character, it must be escaped: • "Af"r,ica" Becomes “Af”r and ica” fields • "Af”"r,ica" Becomes “Af”r,ica” (a single field) – Each line should have a line terminator • The specification really only matters to you if you have problems importing a particular file.
  • 14. 14 CSV Specification and Issues • What happens if delimiter-containing fields are not quoted? – Depends on consistency of data – OK: Applicant, Instructor Resources, Human Management, Fiscal – Not OK: Applicant, Instructor Resources, Human Training • Inconsistent number of fields • Column names exist in some files (but not all files) • Page titles
  • 15. 15 Fixing CSV Data Outside of SAS • I strongly advise against CSV in favor of TAB delimited • Second choice is delimited by other character (like ‘|’ pipe) • Ask for a new version properly formatted • Manually edit the file • Use some UNIX/Linux and SAS tricks
  • 16. 16 Tips and Tricks – Modifying Generated Code • My favorite trick is to let PROC IMPORT write the initial code for a file and then copy/paste/edit the resulting DATA step into the program. • Can alter variable names, formats, and data types • Can use more INFILE options like DLMSTR, DLMSOPT, END, others • Can perform data validation/correction – Excel will let you enter ‘14-JUL’ as a date, treat it as July 14 2014, but write it out to the file as ‘14-JUL’ – Humans are lazy. – Your program could input the field and correct in on flow through the data. • Programmers are lazy. I don’t want to have to type all the variables in a large delimited file. • UNIX has a tool to cut columns out of a file: – cut -c 10- csv3.log > data.txt – Will eliminate row numbers and formatting in log columns 1-9
  • 17. 17 Tips and Tricks – Hanging PROC • If you see this error when using PROC IMPORT or EXPORT while running from the command line: ERROR: Cannot open X display. Check the name/server access authorization. • Then the wizard is trying to run interactively. Your command line should include: sas –noterminal • Supposed to be fixed in v9.3 • Another symptom is that the PROC “hangs”
  • 18. 18 Tips and Tricks – Delimited File Not Quote Escaped • Problem reported recently by Don Stanley bchr0001@GMAIL.COM • This is the case where quotes are just data: AAAA.BBBB.'I.CCCC'.0.DL • PROC IMPORT/INFILE DLM DSD wants to treat 'I.CCCC‘ as one field, developer wants it to be two. • With the possibility of empty fields, partial DSD functionality is needed. But DSD also invokes the quoted field handling.
  • 19. 19 Tips and Tricks – Delimited File Not Quote Escaped • SAS Solution (I lost the contributor name) infile custdata dlm=‘,' ; input @ ; do while(index(_infile_,‘,,')>0) ; _infile_ = tranwrd(_infile_,‘,,',‘, ,') ; end ; /* comma-comma to comma-space-comma */ input << delimited fields >> ; – _infile_ is limited to 32,767 – could be problem for large records • UNIX Shell Solution awk '{gsub(“,,", “, ,");print $0;}' INPUTFILE > OUTPUTFILE – Handles (essentially) unlimited record lengths
  • 20. 20 Tips and Tricks – File Contents • UNIX head will show the first few rows – Does the file have extra lines at the top (like a title line)? • UNIX tail will show the last few rows – Do I have all the data? – Are the records consistent beginning to end? • UNIX wc -l (word count, line) will show how many lines in total – Is this file the right size? • You can always edit the file (unless too large) • UNIX od -c -x (object dump, character and hex) is useful to see format and contents of file (including unprintable characters).
  • 21. 21 Tips and Tricks – File Contents • Determine maximum record length and field count (based on delimiters) with a script • This example is written in awk (new awk): BEGIN{ FS=","; mNF=0; mRL=0; } { if (NF > mNF) mNF=NF; RL=length($0); if (RL > mRL) mRL=RL; } END{ print "max NF " mNF " max RECL " mRL;} • Max NF is highest number of fields on any record (does not handle quote escaping) • Max RECL is the longest record length in the file • Can be saved in a file and executed on your data: awk –f csv_check.awk INPUTFILE.csv
  • 22. 22 Tips and Tricks – Dropping Titles • The input CSV: This is a title line and should be here country,shoe,city,count,average salary,max salary,average "Africa","Boot","Addis Ababa","12","$29,761","$191,821","$769" "Asia","Boot","Bangkok","1","$1,996","$9,576","$80" "Canada","Boot","Calgary","8","$17,720","$63,280","$472“ • PROC IMPORT results: Number of names found is less than number of variables found. Name This is a title line and should be here truncated to This_is_a_title_line_and_should. Problems were detected with provided names. See LOG. • How to get rid of the first line since DATAROW= won’t help: – wc -l INPUT – returns value 5 (5 lines) – tail -4 INPUT > OUTPUT – saves last 4 lines
  • 23. 23 Tips and Tricks – Dropping Junk Records at End • The input CSV: country,shoe,city,count,average salary,max salary,average "Africa","Boot","Addis Ababa","12","$29,761","$191,821","$769" "Asia","Boot","Bangkok","1","$1,996","$9,576","$80" "Canada","Boot","Calgary","8","$17,720","$63,280","$472" ,,,,,, • Results in: average_ Obs country shoe city count salary max_salary average 1 Africa Boot Addis Ababa 12 $29,761 $191,821 $769 2 Asia Boot Bangkok 1 $1,996 $9,576 $80 3 Canada Boot Calgary 8 $17,720 $63,280 $472 4 • How to get rid of the last line since DATAROW= won’t help: – wc -l INPUT – returns value 5 (5 lines) – head -4 INPUT > OUTPUT – saves first 4 lines
  • 24. 24 Personal Note • On a Personal Note – I seem to learn a bunch when working on presentations, new classes, and writings – I was still treating PROC IMPORT like I was still using V6, I didn’t realize it would handle quoted strings properly now. – I have to learn some new habits and trust the tool more. – In any commands within this presentation, the single and double quotation marks should be simple, not the “smart quotes” forced my Microsoft. The same applies to dashes or minus signs – they should not be “em dashes” (- versus –)
  • 26. 26 Proc Import References • NOTES – Proc Import with a twist: http://www2.sas.com/proceedings/sugi30/038-30.pdf – Data Example: http://support.sas.com/documentation/cdl/en/proc/6189 5/HTML/default/viewer.htm#a000314361.htm – RFC 4180, IETF 2005, http://tools.ietf.org/html/rfc4180 – Wiki has good CSV background: http://en.wikipedia.org/wiki/Comma-separated_values – Hanging PROC IMPORT: http://support.sas.com/kb/3/610.html – CSV-1203 “Standard” – identifies some Excel “oddities”: http://mastpoint.curzonnassau.com/csv-1203