SlideShare a Scribd company logo
Using SAS to Bridge the Gap between Computer Science and Quantitative Methods
David R. Dolkart, American Hospital Association
Introduction
The degree of specialization within the fields of com-
puter science and quantitative methods presents a gap
in analyses. The concentration of computer scientists
on efficiency of design and detailed documentation
oftentimes roadblocks quantitative analysis. On the
other hand, researchers' orientation toward quick
results leads to inefficient program designs and scarce
documentation of data sets and program operations.
To avoid the extremes found in this interdisciplinary
gap, a balance between efficiency, documentation, and
turnaround rime is crucial. This balance rests heavily
on five factors: the underlying analysis characteristics;
the nature of the project deadline; computer resource
needs; staff requirements; and, the probability of
repeating an analysis. Failure to fully assess the weight
of each of these factors is a shan cut to detours and
dead ends.
Underlying Characteristics
The data processing segment of a research project in-
volves one or more of the following operations:
1. Data Edits
2. Data Manipulation
3. Report Generation
4. Statistical Calculation
S. Graphing
Generally, a programming language-FORTRAN,
COBOL, or PLll, for example-or a package like SAS
handles projects involving the first three operations.
However, when edit, manipulation, or report opera-
tions dominate a project, the balance sways toward the
efficiency found in a programming language. In proj-
ects where these five operations are equally-important,
the balance rests on a l::ombination of a programming
language and SAS. With few exceptions, projects
involving extensive statistical calculations and graphing
require the use of SAS alone.
Defining the Project Deadline
When external forces determine a project deadline (for
example, a corporate officer requests an analysis to
support a critical decision). snatching the detail of the
quantitative analysis and data processing steps is essen-
tial. Under this time pressure, minimizing design and
documentation paperwork frees up valuable time for
quality control. When deadlines are flexible, there is
more time to spend on documentation. The documen-
tation level cannot dominate and, therefore, roadblock
the analysis. But, by the same token, this documenta-
tion cannot be so scarce as to leave no trace of the data
processing mechanics. Deadline flexibility also allows
more room to maximize computer efficiency. That is,
to minimi7.e storage, memory, and run-time require-
ments.
Computer Resource Needs
The programming language and statistical package
mixture depend on the computer resource needs of an
106
application. These needs include disk storage, com-
puter memory, and run-time. The resource needs of
large data setsl
point the mixture toward the efficiency
found in programming languages. In other applica-
tions, the amount of edits, manipulation, and computa-
tion are the dominant factors in deciding resource re-
quirements; and, therefore, in the programming
language-statistical package mixture decision.
The Role of Staff ReqUirements
In some research analyses, the size and expertise of the
project staff predetermine the mix. For example, sup-
pose one person comprises the staff. The programming
background of that person determines whether SAS,
FORTRAN, PLll, or COBOL are elements of the soft-
ware mixture. Flexibility in staff size and expertise, on
the other hand, provides more opportunities to design
and implement an efficient processing combination.
What is the Probability of Repeating an Analysis?
The ad hoc nature of some research applications imply
that the probability of repeating an analysis is near
zero. Therefore, the development costs necessary to
minimize run-time may outweigh toral cost savings-
regardless of the data set size. As the probability of
repetition increases, the cost savings from minimizing
run-times begins to offset development costs. In some
instances, time constraints indicate the use of a
statistical package alone-even though the probability
of repetition is high. After meeting the initial deadline,
more time exists for using a programming language to
optimize efficiency.
Striving for the Optimal Mix
Examining the five processing attributes of quantitative
analyses provides the means to strive for an optimal
programming language and statistical package equa-
tion. The emphasis here is on to strive for. Given the
deadline and staff constraints found in many research
projects, it is rare to "calculate" this optimal equation.
Since precise calculation is unrealistic, researchers need
a practical method to analyze their data processing re-
quirements. The table below outlines such a method by
defining general attributes which yield three processing
mixtures. In the extreme situations-case 1 and case
3-mixrure decisions are rarely unclear. For example,
the computation of monthly price indexes requires the
efficiency of a programming language. Whereas, the
exploratory nature of model fitting requires the flex-
ibility of a statistical package. The intermediate situa-
tion described by case 2 includes difficult applications
which stretch into case 1 or 3. For example, selecting
1.5 million records, from 5 million records, for
statistical analysis falls in the fringe, which indicates
the use of a programming language alone (case 1) or in
combination with SAS (case 2).
Generai Guidelines for Choosing between a Programming Language and SAS-
or Some Combination of the Two
Attributes
1. Underlying Characteristics
a. Data Editing
b. Data Manipulation
c. Report Generation
d. Statistical Computation
e. Graphing
2. Availability of Pre·wrltten Programs
3. Staff Mix
a. Number
b. Programming Language Experience
c. SAS Experience
4. Project Deadline
5. Computer Resource Needs
a. Disk Storage
h. Computer Memory
c. Run-Time
6. Probability of Repeating
Application on an Ongoing Basis
Decision
• Choose a Programming Language
• Choose a Programming Language and SAS
• Choose SAS
Review the Data Processing Steps
After formulating the software mixture, it is crucial to:
review programs for efficiency; estimate the required
computer resources; and, verify data edits, manipula-
tions, and formulas with a test on a small subset (for
example, 20-100 ubservations) of the data. A careful
review of programs for efficiency double checks
whether computer resource use is at a reasonable level.
Testing all program operations on a small subset of
data is an essential quality control measure. Tests also
minimize costly reruns by providing valuable statistics
for estimating storage, memory. and run-time re-
quirements for the full data set.
A Mixture for 4.4 Miliion Records
Analyzing physicians' charge patterns for twelve
medical procedures becomes tricky when there are 4.4
million records in the data set.2
Even though these
observations comprise 1.5 million records (33.6 percent
of the data), the twelve procedures appear «randomly"
throughout the 4.4 million insurance:: claims.3
That is,
Case 1 Case 2 Case 3
Heavy Muderate Somt:
Heavy Moderate Some
Extensive Minor Minor
No Yes Extensive
No Possibly Extensive
Not Some Not
Available Available Available
- -
At least one One or two One
Yes (e.g., Minor None
PLll)
No Yes Extensive
Open 5-10 Working Less Than
Days 5 Days
Extensive Moderate Moderate
Extensive Moderate Moderate
Over 5 3-7 Minutes 5 Minutes
Minutes At Most
Greater
Than 50% 5-10% 0-5%
X
107
X
X
sorting the data does not eliminate the need to read
each record.
With no assistance from utilities or existing programs,
but with sufficient time and staff, a programming
language provides the best way to pull out physicians
who performed any of these 12 procedures. With its
data manipulation and statistical computation
capabilities, SAS provides the means to achieve flex-
ibility in this analysis. Therefore, a combination of a
programming language and SAS formulate an optimal
solution for this analysis.
With sufficient lead time, a programmer wrote a PL/l
program which processes over 29,000 records per sec-
ond.4 At this rate, 1.1 million Medicare records and
0.4 million Blue Shield private business records came
out of the program in slightly over 2.5 minutes.
Run-time statistics show that an IBM sort utility
arranged the raw (that is, not in SAS data set form)
Medicare records at a rate of 1S,000 per second. In
comparison, SAS's PROC SORT arranged the private
business records (that is, in SAS data set form) at a rate
just exceeding 6,500 per second. This rate differential
clearly indicates the run-time savings available from a
careful outline which identified the data processing
steps.
Medicare data had different codes for comparable
private business procedures. Here, the outline revealed
another significant savings. It was more efficient to
transform the Medicare codes-to match the private
business codes-after calculating means for each
physician-procedure-geographic location combination
(that is, PROC MEANS; by Physician Procedure Loca-
tion). That is, it is cheaper to recode 16.5 thousand
summarized Medicare observations. The Medicare
records were then merged with 13.5 thousand Blue
Shield records-by physician, procedure code, and
location.
Careful planning, the right combination of program-
ming skills, and sufficient lead time bridged the inter-
disciplinary gap by reducing 4.4 million claims to
30,000 physician specific observations with minimized
run-times and maximized analysis flexibility.
Summary
A balance between efficient programs, documentation,
and quick results leads to an optimal mix of a pro-
gramming language and a statistical package. Formu-
lating the ingredients in this software mixture rests on
a careful analysis of five factors:
1. the underlying analysis characteristics;
2. the nature of the project deadline;
3. computer resource requirements;
4. staff requirements;
5. the probability of repeating the analysis.
Furthermore, a careful review of a project's purpose
and this data processing mix providc the road map to
avoid errors and expensive program reruns. Ultimately,
this optimal programming 1anguage-statistical package
formula provides the vehicle to bridge the gap between
computer science and quantitative methods.
ll)ata set size is a relative concept which depends on computer
capacity. To any computer running at peak capacity ont million
records comprise a large data set. As the load on this computer
diminishes, these rE'cords shrink toward a medium-size data set.
lThe cost of collecting the data used in thi~ paper was funded by
Blue Cross and Blue Shield Associations and the Health Care
Financing Administration under Contract 500-78-00l.
JFor the purpose of this paper, records and claims are
synonymous.
4Note that this processing rate applies to an AMDAHL 470/V8
configuration which processes over six million instructions per
second.
lOB

More Related Content

What's hot

F1803013034
F1803013034F1803013034
F1803013034
IOSR Journals
 
The Database Environment Chapter 6
The Database Environment Chapter 6The Database Environment Chapter 6
The Database Environment Chapter 6
Jeanie Arnoco
 
Lecture 02 - The Data Warehouse Environment
Lecture 02 - The Data Warehouse Environment Lecture 02 - The Data Warehouse Environment
Lecture 02 - The Data Warehouse Environment
phanleson
 
11.challenging issues of spatio temporal data mining
11.challenging issues of spatio temporal data mining11.challenging issues of spatio temporal data mining
11.challenging issues of spatio temporal data mining
Alexander Decker
 
SURVEY ON SCHEDULING AND ALLOCATION IN HIGH LEVEL SYNTHESIS
SURVEY ON SCHEDULING AND ALLOCATION IN HIGH LEVEL SYNTHESISSURVEY ON SCHEDULING AND ALLOCATION IN HIGH LEVEL SYNTHESIS
SURVEY ON SCHEDULING AND ALLOCATION IN HIGH LEVEL SYNTHESIS
cscpconf
 
D01 etl
D01 etlD01 etl
D01 etl
Prince Jain
 
Query processing
Query processingQuery processing
Query processing
University of Potsdam
 
Control processing
Control processingControl processing
Control processing
Miguel Martinez
 
REAL-TIME CHANGE DATA CAPTURE USING STAGING TABLES AND DELTA VIEW GENERATION...
 REAL-TIME CHANGE DATA CAPTURE USING STAGING TABLES AND DELTA VIEW GENERATION... REAL-TIME CHANGE DATA CAPTURE USING STAGING TABLES AND DELTA VIEW GENERATION...
REAL-TIME CHANGE DATA CAPTURE USING STAGING TABLES AND DELTA VIEW GENERATION...
ijiert bestjournal
 
Enterprise application characteristics
Enterprise application characteristicsEnterprise application characteristics
Enterprise application characteristics
Salegram Padhee
 
Ibm big data
Ibm big dataIbm big data
Ibm big data
Peter Tutty
 
Lecture 03 - The Data Warehouse and Design
Lecture 03 - The Data Warehouse and Design Lecture 03 - The Data Warehouse and Design
Lecture 03 - The Data Warehouse and Design
phanleson
 

What's hot (13)

F1803013034
F1803013034F1803013034
F1803013034
 
The Database Environment Chapter 6
The Database Environment Chapter 6The Database Environment Chapter 6
The Database Environment Chapter 6
 
Lecture 02 - The Data Warehouse Environment
Lecture 02 - The Data Warehouse Environment Lecture 02 - The Data Warehouse Environment
Lecture 02 - The Data Warehouse Environment
 
11.challenging issues of spatio temporal data mining
11.challenging issues of spatio temporal data mining11.challenging issues of spatio temporal data mining
11.challenging issues of spatio temporal data mining
 
SURVEY ON SCHEDULING AND ALLOCATION IN HIGH LEVEL SYNTHESIS
SURVEY ON SCHEDULING AND ALLOCATION IN HIGH LEVEL SYNTHESISSURVEY ON SCHEDULING AND ALLOCATION IN HIGH LEVEL SYNTHESIS
SURVEY ON SCHEDULING AND ALLOCATION IN HIGH LEVEL SYNTHESIS
 
D01 etl
D01 etlD01 etl
D01 etl
 
Data warehouse
Data warehouseData warehouse
Data warehouse
 
Query processing
Query processingQuery processing
Query processing
 
Control processing
Control processingControl processing
Control processing
 
REAL-TIME CHANGE DATA CAPTURE USING STAGING TABLES AND DELTA VIEW GENERATION...
 REAL-TIME CHANGE DATA CAPTURE USING STAGING TABLES AND DELTA VIEW GENERATION... REAL-TIME CHANGE DATA CAPTURE USING STAGING TABLES AND DELTA VIEW GENERATION...
REAL-TIME CHANGE DATA CAPTURE USING STAGING TABLES AND DELTA VIEW GENERATION...
 
Enterprise application characteristics
Enterprise application characteristicsEnterprise application characteristics
Enterprise application characteristics
 
Ibm big data
Ibm big dataIbm big data
Ibm big data
 
Lecture 03 - The Data Warehouse and Design
Lecture 03 - The Data Warehouse and Design Lecture 03 - The Data Warehouse and Design
Lecture 03 - The Data Warehouse and Design
 

Similar to Sugi-82-26 Dolkart

Survey on Software Data Reduction Techniques Accomplishing Bug Triage
Survey on Software Data Reduction Techniques Accomplishing Bug TriageSurvey on Software Data Reduction Techniques Accomplishing Bug Triage
Survey on Software Data Reduction Techniques Accomplishing Bug Triage
IRJET Journal
 
Enhancement techniques for data warehouse staging area
Enhancement techniques for data warehouse staging areaEnhancement techniques for data warehouse staging area
Enhancement techniques for data warehouse staging area
IJDKP
 
ONE HIDDEN LAYER ANFIS MODEL FOR OOS DEVELOPMENT EFFORT ESTIMATION
ONE HIDDEN LAYER ANFIS MODEL FOR OOS DEVELOPMENT EFFORT ESTIMATIONONE HIDDEN LAYER ANFIS MODEL FOR OOS DEVELOPMENT EFFORT ESTIMATION
ONE HIDDEN LAYER ANFIS MODEL FOR OOS DEVELOPMENT EFFORT ESTIMATION
International Journal of Technical Research & Application
 
Power Management in Micro grid Using Hybrid Energy Storage System
Power Management in Micro grid Using Hybrid Energy Storage SystemPower Management in Micro grid Using Hybrid Energy Storage System
Power Management in Micro grid Using Hybrid Energy Storage System
ijcnes
 
Journal for Clinical Studies: Examination of Roles in Data Management in Clin...
Journal for Clinical Studies: Examination of Roles in Data Management in Clin...Journal for Clinical Studies: Examination of Roles in Data Management in Clin...
Journal for Clinical Studies: Examination of Roles in Data Management in Clin...
KCR
 
133
133133
Elimination of data redundancy before persisting into dbms using svm classifi...
Elimination of data redundancy before persisting into dbms using svm classifi...Elimination of data redundancy before persisting into dbms using svm classifi...
Elimination of data redundancy before persisting into dbms using svm classifi...
nalini manogaran
 
AN INTEGER-LINEAR ALGORITHM FOR OPTIMIZING ENERGY EFFICIENCY IN DATA CENTERS
AN INTEGER-LINEAR ALGORITHM FOR OPTIMIZING ENERGY EFFICIENCY IN DATA CENTERSAN INTEGER-LINEAR ALGORITHM FOR OPTIMIZING ENERGY EFFICIENCY IN DATA CENTERS
AN INTEGER-LINEAR ALGORITHM FOR OPTIMIZING ENERGY EFFICIENCY IN DATA CENTERS
ijfcstjournal
 
Distributed Feature Selection for Efficient Economic Big Data Analysis
Distributed Feature Selection for Efficient Economic Big Data AnalysisDistributed Feature Selection for Efficient Economic Big Data Analysis
Distributed Feature Selection for Efficient Economic Big Data Analysis
IRJET Journal
 
A Summary Of OR MS Software On Microcomputers
A Summary Of OR MS Software On MicrocomputersA Summary Of OR MS Software On Microcomputers
A Summary Of OR MS Software On Microcomputers
Christine Williams
 
AN INTEGER-LINEAR ALGORITHM FOR OPTIMIZING ENERGY EFFICIENCY IN DATA CENTERS
AN INTEGER-LINEAR ALGORITHM FOR OPTIMIZING ENERGY EFFICIENCY IN DATA CENTERSAN INTEGER-LINEAR ALGORITHM FOR OPTIMIZING ENERGY EFFICIENCY IN DATA CENTERS
AN INTEGER-LINEAR ALGORITHM FOR OPTIMIZING ENERGY EFFICIENCY IN DATA CENTERS
ijfcstjournal
 
USING FACTORY DESIGN PATTERNS IN MAP REDUCE DESIGN FOR BIG DATA ANALYTICS
USING FACTORY DESIGN PATTERNS IN MAP REDUCE DESIGN FOR BIG DATA ANALYTICSUSING FACTORY DESIGN PATTERNS IN MAP REDUCE DESIGN FOR BIG DATA ANALYTICS
USING FACTORY DESIGN PATTERNS IN MAP REDUCE DESIGN FOR BIG DATA ANALYTICS
HCL Technologies
 
Proceedings of the 2015 Industrial and Systems Engineering Res.docx
Proceedings of the 2015 Industrial and Systems Engineering Res.docxProceedings of the 2015 Industrial and Systems Engineering Res.docx
Proceedings of the 2015 Industrial and Systems Engineering Res.docx
wkyra78
 
Relative parameter quantification in data
Relative parameter quantification in dataRelative parameter quantification in data
Relative parameter quantification in data
IJDKP
 
A HYPER-HEURISTIC METHOD FOR SCHEDULING THEJOBS IN CLOUD ENVIRONMENT
A HYPER-HEURISTIC METHOD FOR SCHEDULING THEJOBS IN CLOUD ENVIRONMENTA HYPER-HEURISTIC METHOD FOR SCHEDULING THEJOBS IN CLOUD ENVIRONMENT
A HYPER-HEURISTIC METHOD FOR SCHEDULING THEJOBS IN CLOUD ENVIRONMENT
ieijjournal1
 
A HYPER-HEURISTIC METHOD FOR SCHEDULING THEJOBS IN CLOUD ENVIRONMENT
A HYPER-HEURISTIC METHOD FOR SCHEDULING THEJOBS IN CLOUD ENVIRONMENTA HYPER-HEURISTIC METHOD FOR SCHEDULING THEJOBS IN CLOUD ENVIRONMENT
A HYPER-HEURISTIC METHOD FOR SCHEDULING THEJOBS IN CLOUD ENVIRONMENT
ieijjournal
 
Struggling with data management
Struggling with data managementStruggling with data management
Struggling with data managementDavid Walker
 
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...
ijdpsjournal
 
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...
ijdpsjournal
 

Similar to Sugi-82-26 Dolkart (20)

Survey on Software Data Reduction Techniques Accomplishing Bug Triage
Survey on Software Data Reduction Techniques Accomplishing Bug TriageSurvey on Software Data Reduction Techniques Accomplishing Bug Triage
Survey on Software Data Reduction Techniques Accomplishing Bug Triage
 
Enhancement techniques for data warehouse staging area
Enhancement techniques for data warehouse staging areaEnhancement techniques for data warehouse staging area
Enhancement techniques for data warehouse staging area
 
ONE HIDDEN LAYER ANFIS MODEL FOR OOS DEVELOPMENT EFFORT ESTIMATION
ONE HIDDEN LAYER ANFIS MODEL FOR OOS DEVELOPMENT EFFORT ESTIMATIONONE HIDDEN LAYER ANFIS MODEL FOR OOS DEVELOPMENT EFFORT ESTIMATION
ONE HIDDEN LAYER ANFIS MODEL FOR OOS DEVELOPMENT EFFORT ESTIMATION
 
Power Management in Micro grid Using Hybrid Energy Storage System
Power Management in Micro grid Using Hybrid Energy Storage SystemPower Management in Micro grid Using Hybrid Energy Storage System
Power Management in Micro grid Using Hybrid Energy Storage System
 
Journal for Clinical Studies: Examination of Roles in Data Management in Clin...
Journal for Clinical Studies: Examination of Roles in Data Management in Clin...Journal for Clinical Studies: Examination of Roles in Data Management in Clin...
Journal for Clinical Studies: Examination of Roles in Data Management in Clin...
 
133
133133
133
 
Elimination of data redundancy before persisting into dbms using svm classifi...
Elimination of data redundancy before persisting into dbms using svm classifi...Elimination of data redundancy before persisting into dbms using svm classifi...
Elimination of data redundancy before persisting into dbms using svm classifi...
 
AN INTEGER-LINEAR ALGORITHM FOR OPTIMIZING ENERGY EFFICIENCY IN DATA CENTERS
AN INTEGER-LINEAR ALGORITHM FOR OPTIMIZING ENERGY EFFICIENCY IN DATA CENTERSAN INTEGER-LINEAR ALGORITHM FOR OPTIMIZING ENERGY EFFICIENCY IN DATA CENTERS
AN INTEGER-LINEAR ALGORITHM FOR OPTIMIZING ENERGY EFFICIENCY IN DATA CENTERS
 
Distributed Feature Selection for Efficient Economic Big Data Analysis
Distributed Feature Selection for Efficient Economic Big Data AnalysisDistributed Feature Selection for Efficient Economic Big Data Analysis
Distributed Feature Selection for Efficient Economic Big Data Analysis
 
A Summary Of OR MS Software On Microcomputers
A Summary Of OR MS Software On MicrocomputersA Summary Of OR MS Software On Microcomputers
A Summary Of OR MS Software On Microcomputers
 
AN INTEGER-LINEAR ALGORITHM FOR OPTIMIZING ENERGY EFFICIENCY IN DATA CENTERS
AN INTEGER-LINEAR ALGORITHM FOR OPTIMIZING ENERGY EFFICIENCY IN DATA CENTERSAN INTEGER-LINEAR ALGORITHM FOR OPTIMIZING ENERGY EFFICIENCY IN DATA CENTERS
AN INTEGER-LINEAR ALGORITHM FOR OPTIMIZING ENERGY EFFICIENCY IN DATA CENTERS
 
USING FACTORY DESIGN PATTERNS IN MAP REDUCE DESIGN FOR BIG DATA ANALYTICS
USING FACTORY DESIGN PATTERNS IN MAP REDUCE DESIGN FOR BIG DATA ANALYTICSUSING FACTORY DESIGN PATTERNS IN MAP REDUCE DESIGN FOR BIG DATA ANALYTICS
USING FACTORY DESIGN PATTERNS IN MAP REDUCE DESIGN FOR BIG DATA ANALYTICS
 
Proceedings of the 2015 Industrial and Systems Engineering Res.docx
Proceedings of the 2015 Industrial and Systems Engineering Res.docxProceedings of the 2015 Industrial and Systems Engineering Res.docx
Proceedings of the 2015 Industrial and Systems Engineering Res.docx
 
Relative parameter quantification in data
Relative parameter quantification in dataRelative parameter quantification in data
Relative parameter quantification in data
 
A HYPER-HEURISTIC METHOD FOR SCHEDULING THEJOBS IN CLOUD ENVIRONMENT
A HYPER-HEURISTIC METHOD FOR SCHEDULING THEJOBS IN CLOUD ENVIRONMENTA HYPER-HEURISTIC METHOD FOR SCHEDULING THEJOBS IN CLOUD ENVIRONMENT
A HYPER-HEURISTIC METHOD FOR SCHEDULING THEJOBS IN CLOUD ENVIRONMENT
 
A HYPER-HEURISTIC METHOD FOR SCHEDULING THEJOBS IN CLOUD ENVIRONMENT
A HYPER-HEURISTIC METHOD FOR SCHEDULING THEJOBS IN CLOUD ENVIRONMENTA HYPER-HEURISTIC METHOD FOR SCHEDULING THEJOBS IN CLOUD ENVIRONMENT
A HYPER-HEURISTIC METHOD FOR SCHEDULING THEJOBS IN CLOUD ENVIRONMENT
 
Aggreagate awareness
Aggreagate awarenessAggreagate awareness
Aggreagate awareness
 
Struggling with data management
Struggling with data managementStruggling with data management
Struggling with data management
 
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...
 
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...
 

Sugi-82-26 Dolkart

  • 1. Using SAS to Bridge the Gap between Computer Science and Quantitative Methods David R. Dolkart, American Hospital Association Introduction The degree of specialization within the fields of com- puter science and quantitative methods presents a gap in analyses. The concentration of computer scientists on efficiency of design and detailed documentation oftentimes roadblocks quantitative analysis. On the other hand, researchers' orientation toward quick results leads to inefficient program designs and scarce documentation of data sets and program operations. To avoid the extremes found in this interdisciplinary gap, a balance between efficiency, documentation, and turnaround rime is crucial. This balance rests heavily on five factors: the underlying analysis characteristics; the nature of the project deadline; computer resource needs; staff requirements; and, the probability of repeating an analysis. Failure to fully assess the weight of each of these factors is a shan cut to detours and dead ends. Underlying Characteristics The data processing segment of a research project in- volves one or more of the following operations: 1. Data Edits 2. Data Manipulation 3. Report Generation 4. Statistical Calculation S. Graphing Generally, a programming language-FORTRAN, COBOL, or PLll, for example-or a package like SAS handles projects involving the first three operations. However, when edit, manipulation, or report opera- tions dominate a project, the balance sways toward the efficiency found in a programming language. In proj- ects where these five operations are equally-important, the balance rests on a l::ombination of a programming language and SAS. With few exceptions, projects involving extensive statistical calculations and graphing require the use of SAS alone. Defining the Project Deadline When external forces determine a project deadline (for example, a corporate officer requests an analysis to support a critical decision). snatching the detail of the quantitative analysis and data processing steps is essen- tial. Under this time pressure, minimizing design and documentation paperwork frees up valuable time for quality control. When deadlines are flexible, there is more time to spend on documentation. The documen- tation level cannot dominate and, therefore, roadblock the analysis. But, by the same token, this documenta- tion cannot be so scarce as to leave no trace of the data processing mechanics. Deadline flexibility also allows more room to maximize computer efficiency. That is, to minimi7.e storage, memory, and run-time require- ments. Computer Resource Needs The programming language and statistical package mixture depend on the computer resource needs of an 106 application. These needs include disk storage, com- puter memory, and run-time. The resource needs of large data setsl point the mixture toward the efficiency found in programming languages. In other applica- tions, the amount of edits, manipulation, and computa- tion are the dominant factors in deciding resource re- quirements; and, therefore, in the programming language-statistical package mixture decision. The Role of Staff ReqUirements In some research analyses, the size and expertise of the project staff predetermine the mix. For example, sup- pose one person comprises the staff. The programming background of that person determines whether SAS, FORTRAN, PLll, or COBOL are elements of the soft- ware mixture. Flexibility in staff size and expertise, on the other hand, provides more opportunities to design and implement an efficient processing combination. What is the Probability of Repeating an Analysis? The ad hoc nature of some research applications imply that the probability of repeating an analysis is near zero. Therefore, the development costs necessary to minimize run-time may outweigh toral cost savings- regardless of the data set size. As the probability of repetition increases, the cost savings from minimizing run-times begins to offset development costs. In some instances, time constraints indicate the use of a statistical package alone-even though the probability of repetition is high. After meeting the initial deadline, more time exists for using a programming language to optimize efficiency. Striving for the Optimal Mix Examining the five processing attributes of quantitative analyses provides the means to strive for an optimal programming language and statistical package equa- tion. The emphasis here is on to strive for. Given the deadline and staff constraints found in many research projects, it is rare to "calculate" this optimal equation. Since precise calculation is unrealistic, researchers need a practical method to analyze their data processing re- quirements. The table below outlines such a method by defining general attributes which yield three processing mixtures. In the extreme situations-case 1 and case 3-mixrure decisions are rarely unclear. For example, the computation of monthly price indexes requires the efficiency of a programming language. Whereas, the exploratory nature of model fitting requires the flex- ibility of a statistical package. The intermediate situa- tion described by case 2 includes difficult applications which stretch into case 1 or 3. For example, selecting 1.5 million records, from 5 million records, for statistical analysis falls in the fringe, which indicates the use of a programming language alone (case 1) or in combination with SAS (case 2).
  • 2. Generai Guidelines for Choosing between a Programming Language and SAS- or Some Combination of the Two Attributes 1. Underlying Characteristics a. Data Editing b. Data Manipulation c. Report Generation d. Statistical Computation e. Graphing 2. Availability of Pre·wrltten Programs 3. Staff Mix a. Number b. Programming Language Experience c. SAS Experience 4. Project Deadline 5. Computer Resource Needs a. Disk Storage h. Computer Memory c. Run-Time 6. Probability of Repeating Application on an Ongoing Basis Decision • Choose a Programming Language • Choose a Programming Language and SAS • Choose SAS Review the Data Processing Steps After formulating the software mixture, it is crucial to: review programs for efficiency; estimate the required computer resources; and, verify data edits, manipula- tions, and formulas with a test on a small subset (for example, 20-100 ubservations) of the data. A careful review of programs for efficiency double checks whether computer resource use is at a reasonable level. Testing all program operations on a small subset of data is an essential quality control measure. Tests also minimize costly reruns by providing valuable statistics for estimating storage, memory. and run-time re- quirements for the full data set. A Mixture for 4.4 Miliion Records Analyzing physicians' charge patterns for twelve medical procedures becomes tricky when there are 4.4 million records in the data set.2 Even though these observations comprise 1.5 million records (33.6 percent of the data), the twelve procedures appear «randomly" throughout the 4.4 million insurance:: claims.3 That is, Case 1 Case 2 Case 3 Heavy Muderate Somt: Heavy Moderate Some Extensive Minor Minor No Yes Extensive No Possibly Extensive Not Some Not Available Available Available - - At least one One or two One Yes (e.g., Minor None PLll) No Yes Extensive Open 5-10 Working Less Than Days 5 Days Extensive Moderate Moderate Extensive Moderate Moderate Over 5 3-7 Minutes 5 Minutes Minutes At Most Greater Than 50% 5-10% 0-5% X 107 X X sorting the data does not eliminate the need to read each record. With no assistance from utilities or existing programs, but with sufficient time and staff, a programming language provides the best way to pull out physicians who performed any of these 12 procedures. With its data manipulation and statistical computation capabilities, SAS provides the means to achieve flex- ibility in this analysis. Therefore, a combination of a programming language and SAS formulate an optimal solution for this analysis. With sufficient lead time, a programmer wrote a PL/l program which processes over 29,000 records per sec- ond.4 At this rate, 1.1 million Medicare records and 0.4 million Blue Shield private business records came out of the program in slightly over 2.5 minutes. Run-time statistics show that an IBM sort utility arranged the raw (that is, not in SAS data set form) Medicare records at a rate of 1S,000 per second. In
  • 3. comparison, SAS's PROC SORT arranged the private business records (that is, in SAS data set form) at a rate just exceeding 6,500 per second. This rate differential clearly indicates the run-time savings available from a careful outline which identified the data processing steps. Medicare data had different codes for comparable private business procedures. Here, the outline revealed another significant savings. It was more efficient to transform the Medicare codes-to match the private business codes-after calculating means for each physician-procedure-geographic location combination (that is, PROC MEANS; by Physician Procedure Loca- tion). That is, it is cheaper to recode 16.5 thousand summarized Medicare observations. The Medicare records were then merged with 13.5 thousand Blue Shield records-by physician, procedure code, and location. Careful planning, the right combination of program- ming skills, and sufficient lead time bridged the inter- disciplinary gap by reducing 4.4 million claims to 30,000 physician specific observations with minimized run-times and maximized analysis flexibility. Summary A balance between efficient programs, documentation, and quick results leads to an optimal mix of a pro- gramming language and a statistical package. Formu- lating the ingredients in this software mixture rests on a careful analysis of five factors: 1. the underlying analysis characteristics; 2. the nature of the project deadline; 3. computer resource requirements; 4. staff requirements; 5. the probability of repeating the analysis. Furthermore, a careful review of a project's purpose and this data processing mix providc the road map to avoid errors and expensive program reruns. Ultimately, this optimal programming 1anguage-statistical package formula provides the vehicle to bridge the gap between computer science and quantitative methods. ll)ata set size is a relative concept which depends on computer capacity. To any computer running at peak capacity ont million records comprise a large data set. As the load on this computer diminishes, these rE'cords shrink toward a medium-size data set. lThe cost of collecting the data used in thi~ paper was funded by Blue Cross and Blue Shield Associations and the Health Care Financing Administration under Contract 500-78-00l. JFor the purpose of this paper, records and claims are synonymous. 4Note that this processing rate applies to an AMDAHL 470/V8 configuration which processes over six million instructions per second. lOB