Using SAS to Bridge the Gap between Computer Science and Quantitative Methods discusses how to optimize the balance between efficient programming, documentation, and quick results. The optimal mix depends on 5 key factors: 1) the underlying analysis characteristics, 2) project deadlines, 3) computer resource needs, 4) staff requirements, and 5) probability of repeating the analysis. A case study describes analyzing 4.4 million medical records using a combination of PL/I for efficient data manipulation and SAS for flexibility. Careful planning of the software mixture minimized processing time from over 2.5 minutes to under 30 seconds per record.
Survey of streaming data warehouse update schedulingeSAT Journals
In this paper, we study scheduling problem of updates for the streaming data warehouses. The streaming data warehouses are the combination of traditional data warehouses and data stream systems. In this, jobs are nothing but the processes which are responsible for loading new data in the tables. Its purpose is to decrease the data staleness. In addition, it handles well, the challenges faced by the streaming warehouses like, data consistency, view hierarchies, heterogeneity found in update jobs because of dissimilar arrival times as well as size of data, preempt updates etc. The staleness of data is the scheduling metric considered here. In this, jobs are nothing but the processes which are responsible for loading new data in the tables. Its purpose is to decrease the data staleness. In addition, it handles well, the challenges faced by the streaming warehouses like, data consistency, view hierarchies, heterogeneity found in update jobs because of dissimilar arrival times as well as size of data, preempt updates etc. The staleness of data is the scheduling metric considered here.
Keywords: partitioning strategy, scalable scheduling, data stream management system.
Survey of streaming data warehouse update schedulingeSAT Journals
In this paper, we study scheduling problem of updates for the streaming data warehouses. The streaming data warehouses are the combination of traditional data warehouses and data stream systems. In this, jobs are nothing but the processes which are responsible for loading new data in the tables. Its purpose is to decrease the data staleness. In addition, it handles well, the challenges faced by the streaming warehouses like, data consistency, view hierarchies, heterogeneity found in update jobs because of dissimilar arrival times as well as size of data, preempt updates etc. The staleness of data is the scheduling metric considered here. In this, jobs are nothing but the processes which are responsible for loading new data in the tables. Its purpose is to decrease the data staleness. In addition, it handles well, the challenges faced by the streaming warehouses like, data consistency, view hierarchies, heterogeneity found in update jobs because of dissimilar arrival times as well as size of data, preempt updates etc. The staleness of data is the scheduling metric considered here.
Keywords: partitioning strategy, scalable scheduling, data stream management system.
SURVEY ON SCHEDULING AND ALLOCATION IN HIGH LEVEL SYNTHESIScscpconf
This paper presents the detailed survey of scheduling and allocation techniques in the High Level Synthesis (HLS) presented in the research literature. It also presents the methodologies and techniques to improve the Speed, (silicon) Area and Power in High Level Synthesis, which are presented in the research literature.
REAL-TIME CHANGE DATA CAPTURE USING STAGING TABLES AND DELTA VIEW GENERATION...ijiert bestjournal
In the big data era,data become more important for Business Intelligence and Enterprise Data Analytic s system operation. The load cycle of traditional data wareh ouse is fixed and longer,which can�t timely respon se the rapid and real time data change. Real-time data warehouse technology as an extension of traditional data war ehouse can capture the rapid data change and process the real- time data analysis to meet the requirement of Busin ess Intelligence and Enterprise Data Analytics system. The real-time data access without processing delay is a challenging task to the real-time data warehouse. I n this paper we discusses current CDC technologies and presents the theory about why they are unable to deliver cha nges in real-time. This paper also explain the appr oaches of dimension delta view generation of incremental load ing of real-time data and staging table ETL framewo rk to process the historical data and real-time data sepa rately. Incremental load is the preferred approach in efficient ETL processes. Delta view and stage table framework for a dimension encompasses all its source table and p roduces a set of keys that should be incrementally processed. We have employed this approach in real world project a nd have noticed an effectiveness of real-time data ETL and reduction in the loading time of big dimension.
Enhancement techniques for data warehouse staging areaIJDKP
Poor performance can turn a successful data warehousing project into a failure. Consequently, several
attempts have been made by various researchers to deal with the problem of scheduling the Extract-
Transform-Load (ETL) process. In this paper we therefore present several approaches in the context of
enhancing the data warehousing Extract, Transform and loading stages. We focus on enhancing the
performance of extract and transform phases by proposing two algorithms that reduce the time needed in
each phase through employing the hidden semantic information in the data. Using the semantic
information, a large volume of useless data can be pruned in early design stage. We also focus on the
problem of scheduling the execution of the ETL activities, with the goal of minimizing ETL execution time.
We explore and invest in this area by choosing three scheduling techniques for ETL. Finally, we
experimentally show their behavior in terms of execution time in the sales domain to understand the impact
of implementing any of them and choosing the one leading to maximum performance enhancement.
In the present paper, applicability and
capability of A.I techniques for effort estimation prediction has
been investigated. It is seen that neuro fuzzy models are very
robust, characterized by fast computation, capable of handling
the distorted data. Due to the presence of data non-linearity, it is
an efficient quantitative tool to predict effort estimation. The one
hidden layer network has been developed named as OHLANFIS
using MATLAB simulation environment.
Here the initial parameters of the OHLANFIS are
identified using the subtractive clustering method. Parameters of
the Gaussian membership function are optimally determined
using the hybrid learning algorithm. From the analysis it is seen
that the Effort Estimation prediction model developed using
OHLANFIS technique has been able to perform well over normal
ANFIS Model.
Power Management in Micro grid Using Hybrid Energy Storage Systemijcnes
This paper proposed for power management in micro grid using a hybrid distributed generator based on photovoltaic, wind-driven PMDC and energy storage system is proposed. In this generator, the sources are together connected to the grid with the help of interleaved boost converter followed by an inverter. Thus, compared to earlier schemes, the proposed scheme has fewer power converters. FUZZY based MPPT controllers are also proposed for the new hybrid scheme to separately trigger the interleaved DC-DC converter and the inverter for tracking the maximum power from both the sources. The integrated operations of both the proposed controllers for different conditions are demonstrated through simulation with the help of MATLAB software
Journal for Clinical Studies: Examination of Roles in Data Management in Clin...KCR
With the development, implementation and gradual evolution of IT systems, the clinical research industry had undergone years of ever-narrowing specialization. Kaia Koppel, Senior Clinical Data Manager at KCR and Martin Noor, Clinical Data Manager at KCR, propose a piece discussing how changes in the digital environment also meant changes in classical ‘clinical data management’ activities, as they became more and more prevalent across all operational levels within the industry. Part 2 of the article on resource organization as a key to achieve efficiency was published in August issue of the Journal for Clinical Studies. (p.18-20)
Elimination of data redundancy before persisting into dbms using svm classifi...nalini manogaran
Elimination of data redundancy before persisting into dbms using svm classification,
Data Base Management System is one of the
growing fields in computing world. Grid computing, internet
sharing, distributed computing, parallel processing and cloud
are the areas store their huge amount of data in a DBMS to
maintain the structure of the data. Memory management is
one of the major portions in DBMS due to edit, delete, recover
and commit operations used on the records. To improve the
memory utilization efficiently, the redundant data should be
eliminated accurately. In this paper, the redundant data is
fetched by the Quick Search Bad Character (QSBC) function
and intimate to the DB admin to remove the redundancy.
QSBC function compares the entire data with patterns taken
from index table created for all the data persisted in the
DBMS to easy comparison of redundant (duplicate) data in
the database. This experiment in examined in SQL server
software on a university student database and performance is
evaluated in terms of time and accuracy. The database is
having 15000 students data involved in various activities.
Keywords—Data redundancy, Data Base Management System,
Support Vector Machine, Data Duplicate.
I. INTRODUCTION
The growing (prenominal) mass of information
present in digital media has become a resistive problem for
data administrators. Usually, shaped on data congregate
from distinct origin, data repositories such as those used by
digital libraries and e-commerce agent based records with
disparate schemata and structures. Also problems regarding
to low response time, availability, security and quality
assurance become more troublesome to manage as the
amount of data grow larger. It is practicable to specimen
that the peculiarity of the data that an association uses in its
systems is relative to its efficiency for offering beneficial
services to their users. In this environment, the
determination of maintenance repositories with “dirty” data
(i.e., with replicas, identification errors, equal patterns,
etc.) goes greatly beyond technical discussion such as the
everywhere quickness or accomplishment of data
administration systems.
Nalini.M, nalini.tptwin@gmail.com, Anbu.S, anomaly detection,
data mining
big data
dbms
intrusion detection
dublicate detection
data cleaning
data redundancy
data replication, redundancy removel, QSBC, Duplicate detection, error correction, de-duplication, Data cleaning, Dbms, Data sets
AN INTEGER-LINEAR ALGORITHM FOR OPTIMIZING ENERGY EFFICIENCY IN DATA CENTERSijfcstjournal
Nowadays, to meet the enormous computational requests, energy consumption, the largest part which is
related to idle resources, is strictly increased as a great part of a data center's budget. So, minimizing
energy consumption is one of the most important issues in the field of green computing. In this paper, we
present a mathematical model formed as integer-linear programming which minimizes energy consumption
and maximizes user’s satisfaction, simultaneously. However, migration variables, as principal decision
variables of the model, can be relaxed to continuous activities in some practical problems. This constraint
relaxation helps a decision maker to find faster solutions that are usually good approximations for
optimum. Near feasible solutions (infeasible solutions that are desirably close to the feasible region) have
been investigated as another relaxation considering the kind of solutions. For this purpose, we initially
present a measure to evaluate the amount of infeasibility of solutions and then let the model consider an
extended region including solutions with remissible infeasibility, if necessary.
SURVEY ON SCHEDULING AND ALLOCATION IN HIGH LEVEL SYNTHESIScscpconf
This paper presents the detailed survey of scheduling and allocation techniques in the High Level Synthesis (HLS) presented in the research literature. It also presents the methodologies and techniques to improve the Speed, (silicon) Area and Power in High Level Synthesis, which are presented in the research literature.
REAL-TIME CHANGE DATA CAPTURE USING STAGING TABLES AND DELTA VIEW GENERATION...ijiert bestjournal
In the big data era,data become more important for Business Intelligence and Enterprise Data Analytic s system operation. The load cycle of traditional data wareh ouse is fixed and longer,which can�t timely respon se the rapid and real time data change. Real-time data warehouse technology as an extension of traditional data war ehouse can capture the rapid data change and process the real- time data analysis to meet the requirement of Busin ess Intelligence and Enterprise Data Analytics system. The real-time data access without processing delay is a challenging task to the real-time data warehouse. I n this paper we discusses current CDC technologies and presents the theory about why they are unable to deliver cha nges in real-time. This paper also explain the appr oaches of dimension delta view generation of incremental load ing of real-time data and staging table ETL framewo rk to process the historical data and real-time data sepa rately. Incremental load is the preferred approach in efficient ETL processes. Delta view and stage table framework for a dimension encompasses all its source table and p roduces a set of keys that should be incrementally processed. We have employed this approach in real world project a nd have noticed an effectiveness of real-time data ETL and reduction in the loading time of big dimension.
Enhancement techniques for data warehouse staging areaIJDKP
Poor performance can turn a successful data warehousing project into a failure. Consequently, several
attempts have been made by various researchers to deal with the problem of scheduling the Extract-
Transform-Load (ETL) process. In this paper we therefore present several approaches in the context of
enhancing the data warehousing Extract, Transform and loading stages. We focus on enhancing the
performance of extract and transform phases by proposing two algorithms that reduce the time needed in
each phase through employing the hidden semantic information in the data. Using the semantic
information, a large volume of useless data can be pruned in early design stage. We also focus on the
problem of scheduling the execution of the ETL activities, with the goal of minimizing ETL execution time.
We explore and invest in this area by choosing three scheduling techniques for ETL. Finally, we
experimentally show their behavior in terms of execution time in the sales domain to understand the impact
of implementing any of them and choosing the one leading to maximum performance enhancement.
In the present paper, applicability and
capability of A.I techniques for effort estimation prediction has
been investigated. It is seen that neuro fuzzy models are very
robust, characterized by fast computation, capable of handling
the distorted data. Due to the presence of data non-linearity, it is
an efficient quantitative tool to predict effort estimation. The one
hidden layer network has been developed named as OHLANFIS
using MATLAB simulation environment.
Here the initial parameters of the OHLANFIS are
identified using the subtractive clustering method. Parameters of
the Gaussian membership function are optimally determined
using the hybrid learning algorithm. From the analysis it is seen
that the Effort Estimation prediction model developed using
OHLANFIS technique has been able to perform well over normal
ANFIS Model.
Power Management in Micro grid Using Hybrid Energy Storage Systemijcnes
This paper proposed for power management in micro grid using a hybrid distributed generator based on photovoltaic, wind-driven PMDC and energy storage system is proposed. In this generator, the sources are together connected to the grid with the help of interleaved boost converter followed by an inverter. Thus, compared to earlier schemes, the proposed scheme has fewer power converters. FUZZY based MPPT controllers are also proposed for the new hybrid scheme to separately trigger the interleaved DC-DC converter and the inverter for tracking the maximum power from both the sources. The integrated operations of both the proposed controllers for different conditions are demonstrated through simulation with the help of MATLAB software
Journal for Clinical Studies: Examination of Roles in Data Management in Clin...KCR
With the development, implementation and gradual evolution of IT systems, the clinical research industry had undergone years of ever-narrowing specialization. Kaia Koppel, Senior Clinical Data Manager at KCR and Martin Noor, Clinical Data Manager at KCR, propose a piece discussing how changes in the digital environment also meant changes in classical ‘clinical data management’ activities, as they became more and more prevalent across all operational levels within the industry. Part 2 of the article on resource organization as a key to achieve efficiency was published in August issue of the Journal for Clinical Studies. (p.18-20)
Elimination of data redundancy before persisting into dbms using svm classifi...nalini manogaran
Elimination of data redundancy before persisting into dbms using svm classification,
Data Base Management System is one of the
growing fields in computing world. Grid computing, internet
sharing, distributed computing, parallel processing and cloud
are the areas store their huge amount of data in a DBMS to
maintain the structure of the data. Memory management is
one of the major portions in DBMS due to edit, delete, recover
and commit operations used on the records. To improve the
memory utilization efficiently, the redundant data should be
eliminated accurately. In this paper, the redundant data is
fetched by the Quick Search Bad Character (QSBC) function
and intimate to the DB admin to remove the redundancy.
QSBC function compares the entire data with patterns taken
from index table created for all the data persisted in the
DBMS to easy comparison of redundant (duplicate) data in
the database. This experiment in examined in SQL server
software on a university student database and performance is
evaluated in terms of time and accuracy. The database is
having 15000 students data involved in various activities.
Keywords—Data redundancy, Data Base Management System,
Support Vector Machine, Data Duplicate.
I. INTRODUCTION
The growing (prenominal) mass of information
present in digital media has become a resistive problem for
data administrators. Usually, shaped on data congregate
from distinct origin, data repositories such as those used by
digital libraries and e-commerce agent based records with
disparate schemata and structures. Also problems regarding
to low response time, availability, security and quality
assurance become more troublesome to manage as the
amount of data grow larger. It is practicable to specimen
that the peculiarity of the data that an association uses in its
systems is relative to its efficiency for offering beneficial
services to their users. In this environment, the
determination of maintenance repositories with “dirty” data
(i.e., with replicas, identification errors, equal patterns,
etc.) goes greatly beyond technical discussion such as the
everywhere quickness or accomplishment of data
administration systems.
Nalini.M, nalini.tptwin@gmail.com, Anbu.S, anomaly detection,
data mining
big data
dbms
intrusion detection
dublicate detection
data cleaning
data redundancy
data replication, redundancy removel, QSBC, Duplicate detection, error correction, de-duplication, Data cleaning, Dbms, Data sets
AN INTEGER-LINEAR ALGORITHM FOR OPTIMIZING ENERGY EFFICIENCY IN DATA CENTERSijfcstjournal
Nowadays, to meet the enormous computational requests, energy consumption, the largest part which is
related to idle resources, is strictly increased as a great part of a data center's budget. So, minimizing
energy consumption is one of the most important issues in the field of green computing. In this paper, we
present a mathematical model formed as integer-linear programming which minimizes energy consumption
and maximizes user’s satisfaction, simultaneously. However, migration variables, as principal decision
variables of the model, can be relaxed to continuous activities in some practical problems. This constraint
relaxation helps a decision maker to find faster solutions that are usually good approximations for
optimum. Near feasible solutions (infeasible solutions that are desirably close to the feasible region) have
been investigated as another relaxation considering the kind of solutions. For this purpose, we initially
present a measure to evaluate the amount of infeasibility of solutions and then let the model consider an
extended region including solutions with remissible infeasibility, if necessary.
AN INTEGER-LINEAR ALGORITHM FOR OPTIMIZING ENERGY EFFICIENCY IN DATA CENTERSijfcstjournal
Nowadays, to meet the enormous computational requests, energy consumption, the largest part which is
related to idle resources, is strictly increased as a great part of a data center's budget. So, minimizing
energy consumption is one of the most important issues in the field of green computing. In this paper, we
present a mathematical model formed as integer-linear programming which minimizes energy consumption
and maximizes user’s satisfaction, simultaneously. However, migration variables, as principal decision
variables of the model, can be relaxed to continuous activities in some practical problems. This constraint
relaxation helps a decision maker to find faster solutions that are usually good approximations for
optimum. Near feasible solutions (infeasible solutions that are desirably close to the feasible region) have
been investigated as another relaxation considering the kind of solutions. For this purpose, we initially
present a measure to evaluate the amount of infeasibility of solutions and then let the model consider an
extended region including solutions with remissible infeasibility, if necessary.
USING FACTORY DESIGN PATTERNS IN MAP REDUCE DESIGN FOR BIG DATA ANALYTICSHCL Technologies
Though insights from Big Data gives a breakthrough to make better business decision, it poses its own set of challenges. This paper addresses the gap of Variety problem and suggest a way to seamlessly handle data processing even if there is change in data type/processing algorithm. It explores the various map reduce design patterns and comes out with a unified working solution (library). The library has the potential to ‘adapt’ itself to any data processing need which can be achieved by Map Reduce saving lot of man hours and enforce good practices in code.
Proceedings of the 2015 Industrial and Systems Engineering Res.docxwkyra78
Proceedings of the 2015 Industrial and Systems Engineering Research Conference
S. Cetinkaya and J. K. Ryan, eds.
Use of Symbolic Regression for Lean Six Sigma Projects
Daniel Moreno-Sanchez, MSc.
Jacobo Tijerina-Aguilera, MSc.
Universidad de Monterrey
San Pedro Garza Garcia, NL 66238, Mexico
Arlethe Yari Aguilar-Villarreal, MEng.
Universidad Autonoma de Nuevo Leon
San Nicolas de los Garza, NL 66451, Mexico
Abstract
Lean Six Sigma projects and the quality engineering profession have to deal with an extensive selection of tools
most of them requiring specialized training. The increased availability of standard statistical software motivates the
use of advanced data science techniques to identify relationships between potential causes and project metrics. In
these circumstances, Symbolic Regression has received increased attention from researchers and practitioners to
uncover the intrinsic relationships hidden within complex data without requiring specialized training for its
implementation. The objective of this paper is to evaluate the advantages and drawbacks of using computer assisted
Symbolic Regression within the Analyze phase of a Lean Six Sigma project. An application of this approach in a
service industry project is also presented.
Keywords
Symbolic Regression, Data Science, Lean Six Sigma
1. Introduction
Lean Six Sigma (LSS) has become a well-known hybrid methodology for quality and productivity improvement in
organizations. Its wide adoption in several industries has shaped Process Innovation and Operational Excellence
initiatives, enabling LSS to become a main topic in quality practitioner sites of interest [1], recognized Six Sigma
(SS) certification body of knowledge contents [2], and professional society conferences [3].
However LSS projects and the quality engineering profession have to deal with an extensive selection of tools most
of them requiring specialized training. To assist LSS practitioners it is common to categorize tools based on the
traditional DMAIC model which stands for Define, Measure, Analyze, Improve, and Control phases. Table 1
presents an overview of the main tools that are commonly used in each phase of a LSS project, allowing team
members to progressively develop an understanding between realizing each phase’s intent and how the selected
tools can contribute to that purpose.
This paper focuses on the Analyze phase where tools for statistical model building are most likely to be selected.
The increased availability of standard statistical software motivates the use of advanced data science techniques to
identify relationships between potential causes and project metrics. In these circumstances Symbolic Regression
(SR) has received increased attention from researchers and practitioners even though SR is still in an early stage of
commercial availability.
The objective of this paper is to evaluate the advantages and drawbacks o ...
Interpreting available data is a focal issue in data mining. Gathering of primary data is a difficult and expensive affair for assessing the trends for any business decision especially when multiple players are present. There is no uniform formula-type work procedure to deduce information from a vast data set especially if the data formats in the secondary sources are not uniform and need enormous cleansing to
mend the data for statistical analysis. In this paper, an incremental approach to cleanse data using a simple yet extended procedure is presented and it is shown how to deduce conclusions to facilitate business decisions. Freely available Indian Telecom Industry’s data over a year is used to illustrate this process. It is shown how to conclude the superiority of one telecom service provider over the others comparing different parameters like network availability, customer service quality etc. using a relative parameter
quantification technique. It is found that this method is computationally less costly than the other known
methods.
A HYPER-HEURISTIC METHOD FOR SCHEDULING THEJOBS IN CLOUD ENVIRONMENTieijjournal1
Currently cloud computing has turned into a promising technology and has become a great key for
satisfying a flexible service oriented , online provision and storage of computing resources and user’s
information in lesser expense with dynamism framework on pay per use basis.In this technology Job
Scheduling Problem is acritical issue. For well-organizedmanaging and handling resources,
administrations, scheduling plays a vital role. This paper shares out the improved Hyper- Heuristic
Scheduling Approach to schedule resources, by taking account of computation time and makespan with two
detection operators. Operators are used to select the low-level heuristics automatically. Conditional
Revealing Algorithm (CRA)idea is applied for finding the job failures while allocating the resources. We
believe that proposed hyper-heuristic achieve better results than other individual heuristics
A HYPER-HEURISTIC METHOD FOR SCHEDULING THEJOBS IN CLOUD ENVIRONMENTieijjournal
Currently cloud computing has turned into a promising technology and has become a great key for satisfying a flexible service oriented , online provision and storage of computing resources and user’s information in lesser expense with dynamism framework on pay per use basis.In this technology Job Scheduling Problem is acritical issue. For well-organizedmanaging and handling resources, administrations, scheduling plays a vital role. This paper shares out the improved Hyper- Heuristic Scheduling Approach to schedule resources, by taking account of computation time and makespan with two detection operators. Operators are used to select the low-level heuristics automatically. Conditional
Revealing Algorithm (CRA)idea is applied for finding the job failures while allocating the resources. We believe that proposed hyper-heuristic achieve better results than other individual heuristics.
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...ijdpsjournal
The paper aims at proposing a solution for designing and developing a seamless automation and
integration of machine learning capabilities for Big Data with the following requirements: 1) the ability to
seamlessly handle and scale very large amount of unstructured and structured data from diversified and
heterogeneous sources; 2) the ability to systematically determine the steps and procedures needed for
analyzing Big Data datasets based on data characteristics, domain expert inputs, and data pre-processing
component; 3) the ability to automatically select the most appropriate libraries and tools to compute and
accelerate the machine learning computations; and 4) the ability to perform Big Data analytics with high
learning performance, but with minimal human intervention and supervision. The whole focus is to provide
a seamless automated and integrated solution which can be effectively used to analyze Big Data with highfrequency
and high-dimensional features from different types of data characteristics and different
application problem domains, with high accuracy, robustness, and scalability. This paper highlights the
research methodologies and research activities that we propose to be conducted by the Big Data
researchers and practitioners in order to develop and support seamless automation and integration of
machine learning capabilities for Big Data analytics.
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...ijdpsjournal
The paper aims at proposing a solution for designing and developing a seamless automation and integration of machine learning capabilities for Big Data with the following requirements: 1) the ability to seamlessly handle and scale very large amount of unstructured and structured data from diversified and heterogeneous sources; 2) the ability to systematically determine the steps and procedures needed for
analyzing Big Data datasets based on data characteristics, domain expert inputs, and data pre-processing component; 3) the ability to automatically select the most appropriate libraries and tools to compute and accelerate the machine learning computations; and 4) the ability to perform Big Data analytics with high learning performance, but with minimal human intervention and supervision. The whole focus is to provide
a seamless automated and integrated solution which can be effectively used to analyze Big Data with highfrequency
and high-dimensional features from different types of data characteristics and different application problem domains, with high accuracy, robustness, and scalability. This paper highlights the research methodologies and research activities that we propose to be conducted by the Big Data researchers and practitioners in order to develop and support seamless automation and integration of machine learning capabilities for Big Data analytics.
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...
Sugi-82-26 Dolkart
1. Using SAS to Bridge the Gap between Computer Science and Quantitative Methods
David R. Dolkart, American Hospital Association
Introduction
The degree of specialization within the fields of com-
puter science and quantitative methods presents a gap
in analyses. The concentration of computer scientists
on efficiency of design and detailed documentation
oftentimes roadblocks quantitative analysis. On the
other hand, researchers' orientation toward quick
results leads to inefficient program designs and scarce
documentation of data sets and program operations.
To avoid the extremes found in this interdisciplinary
gap, a balance between efficiency, documentation, and
turnaround rime is crucial. This balance rests heavily
on five factors: the underlying analysis characteristics;
the nature of the project deadline; computer resource
needs; staff requirements; and, the probability of
repeating an analysis. Failure to fully assess the weight
of each of these factors is a shan cut to detours and
dead ends.
Underlying Characteristics
The data processing segment of a research project in-
volves one or more of the following operations:
1. Data Edits
2. Data Manipulation
3. Report Generation
4. Statistical Calculation
S. Graphing
Generally, a programming language-FORTRAN,
COBOL, or PLll, for example-or a package like SAS
handles projects involving the first three operations.
However, when edit, manipulation, or report opera-
tions dominate a project, the balance sways toward the
efficiency found in a programming language. In proj-
ects where these five operations are equally-important,
the balance rests on a l::ombination of a programming
language and SAS. With few exceptions, projects
involving extensive statistical calculations and graphing
require the use of SAS alone.
Defining the Project Deadline
When external forces determine a project deadline (for
example, a corporate officer requests an analysis to
support a critical decision). snatching the detail of the
quantitative analysis and data processing steps is essen-
tial. Under this time pressure, minimizing design and
documentation paperwork frees up valuable time for
quality control. When deadlines are flexible, there is
more time to spend on documentation. The documen-
tation level cannot dominate and, therefore, roadblock
the analysis. But, by the same token, this documenta-
tion cannot be so scarce as to leave no trace of the data
processing mechanics. Deadline flexibility also allows
more room to maximize computer efficiency. That is,
to minimi7.e storage, memory, and run-time require-
ments.
Computer Resource Needs
The programming language and statistical package
mixture depend on the computer resource needs of an
106
application. These needs include disk storage, com-
puter memory, and run-time. The resource needs of
large data setsl
point the mixture toward the efficiency
found in programming languages. In other applica-
tions, the amount of edits, manipulation, and computa-
tion are the dominant factors in deciding resource re-
quirements; and, therefore, in the programming
language-statistical package mixture decision.
The Role of Staff ReqUirements
In some research analyses, the size and expertise of the
project staff predetermine the mix. For example, sup-
pose one person comprises the staff. The programming
background of that person determines whether SAS,
FORTRAN, PLll, or COBOL are elements of the soft-
ware mixture. Flexibility in staff size and expertise, on
the other hand, provides more opportunities to design
and implement an efficient processing combination.
What is the Probability of Repeating an Analysis?
The ad hoc nature of some research applications imply
that the probability of repeating an analysis is near
zero. Therefore, the development costs necessary to
minimize run-time may outweigh toral cost savings-
regardless of the data set size. As the probability of
repetition increases, the cost savings from minimizing
run-times begins to offset development costs. In some
instances, time constraints indicate the use of a
statistical package alone-even though the probability
of repetition is high. After meeting the initial deadline,
more time exists for using a programming language to
optimize efficiency.
Striving for the Optimal Mix
Examining the five processing attributes of quantitative
analyses provides the means to strive for an optimal
programming language and statistical package equa-
tion. The emphasis here is on to strive for. Given the
deadline and staff constraints found in many research
projects, it is rare to "calculate" this optimal equation.
Since precise calculation is unrealistic, researchers need
a practical method to analyze their data processing re-
quirements. The table below outlines such a method by
defining general attributes which yield three processing
mixtures. In the extreme situations-case 1 and case
3-mixrure decisions are rarely unclear. For example,
the computation of monthly price indexes requires the
efficiency of a programming language. Whereas, the
exploratory nature of model fitting requires the flex-
ibility of a statistical package. The intermediate situa-
tion described by case 2 includes difficult applications
which stretch into case 1 or 3. For example, selecting
1.5 million records, from 5 million records, for
statistical analysis falls in the fringe, which indicates
the use of a programming language alone (case 1) or in
combination with SAS (case 2).
2. Generai Guidelines for Choosing between a Programming Language and SAS-
or Some Combination of the Two
Attributes
1. Underlying Characteristics
a. Data Editing
b. Data Manipulation
c. Report Generation
d. Statistical Computation
e. Graphing
2. Availability of Pre·wrltten Programs
3. Staff Mix
a. Number
b. Programming Language Experience
c. SAS Experience
4. Project Deadline
5. Computer Resource Needs
a. Disk Storage
h. Computer Memory
c. Run-Time
6. Probability of Repeating
Application on an Ongoing Basis
Decision
• Choose a Programming Language
• Choose a Programming Language and SAS
• Choose SAS
Review the Data Processing Steps
After formulating the software mixture, it is crucial to:
review programs for efficiency; estimate the required
computer resources; and, verify data edits, manipula-
tions, and formulas with a test on a small subset (for
example, 20-100 ubservations) of the data. A careful
review of programs for efficiency double checks
whether computer resource use is at a reasonable level.
Testing all program operations on a small subset of
data is an essential quality control measure. Tests also
minimize costly reruns by providing valuable statistics
for estimating storage, memory. and run-time re-
quirements for the full data set.
A Mixture for 4.4 Miliion Records
Analyzing physicians' charge patterns for twelve
medical procedures becomes tricky when there are 4.4
million records in the data set.2
Even though these
observations comprise 1.5 million records (33.6 percent
of the data), the twelve procedures appear «randomly"
throughout the 4.4 million insurance:: claims.3
That is,
Case 1 Case 2 Case 3
Heavy Muderate Somt:
Heavy Moderate Some
Extensive Minor Minor
No Yes Extensive
No Possibly Extensive
Not Some Not
Available Available Available
- -
At least one One or two One
Yes (e.g., Minor None
PLll)
No Yes Extensive
Open 5-10 Working Less Than
Days 5 Days
Extensive Moderate Moderate
Extensive Moderate Moderate
Over 5 3-7 Minutes 5 Minutes
Minutes At Most
Greater
Than 50% 5-10% 0-5%
X
107
X
X
sorting the data does not eliminate the need to read
each record.
With no assistance from utilities or existing programs,
but with sufficient time and staff, a programming
language provides the best way to pull out physicians
who performed any of these 12 procedures. With its
data manipulation and statistical computation
capabilities, SAS provides the means to achieve flex-
ibility in this analysis. Therefore, a combination of a
programming language and SAS formulate an optimal
solution for this analysis.
With sufficient lead time, a programmer wrote a PL/l
program which processes over 29,000 records per sec-
ond.4 At this rate, 1.1 million Medicare records and
0.4 million Blue Shield private business records came
out of the program in slightly over 2.5 minutes.
Run-time statistics show that an IBM sort utility
arranged the raw (that is, not in SAS data set form)
Medicare records at a rate of 1S,000 per second. In
3. comparison, SAS's PROC SORT arranged the private
business records (that is, in SAS data set form) at a rate
just exceeding 6,500 per second. This rate differential
clearly indicates the run-time savings available from a
careful outline which identified the data processing
steps.
Medicare data had different codes for comparable
private business procedures. Here, the outline revealed
another significant savings. It was more efficient to
transform the Medicare codes-to match the private
business codes-after calculating means for each
physician-procedure-geographic location combination
(that is, PROC MEANS; by Physician Procedure Loca-
tion). That is, it is cheaper to recode 16.5 thousand
summarized Medicare observations. The Medicare
records were then merged with 13.5 thousand Blue
Shield records-by physician, procedure code, and
location.
Careful planning, the right combination of program-
ming skills, and sufficient lead time bridged the inter-
disciplinary gap by reducing 4.4 million claims to
30,000 physician specific observations with minimized
run-times and maximized analysis flexibility.
Summary
A balance between efficient programs, documentation,
and quick results leads to an optimal mix of a pro-
gramming language and a statistical package. Formu-
lating the ingredients in this software mixture rests on
a careful analysis of five factors:
1. the underlying analysis characteristics;
2. the nature of the project deadline;
3. computer resource requirements;
4. staff requirements;
5. the probability of repeating the analysis.
Furthermore, a careful review of a project's purpose
and this data processing mix providc the road map to
avoid errors and expensive program reruns. Ultimately,
this optimal programming 1anguage-statistical package
formula provides the vehicle to bridge the gap between
computer science and quantitative methods.
ll)ata set size is a relative concept which depends on computer
capacity. To any computer running at peak capacity ont million
records comprise a large data set. As the load on this computer
diminishes, these rE'cords shrink toward a medium-size data set.
lThe cost of collecting the data used in thi~ paper was funded by
Blue Cross and Blue Shield Associations and the Health Care
Financing Administration under Contract 500-78-00l.
JFor the purpose of this paper, records and claims are
synonymous.
4Note that this processing rate applies to an AMDAHL 470/V8
configuration which processes over six million instructions per
second.
lOB