Database Integrated Analytics using R: Initial
Experiences with SQL-Server + R
Josep Ll. Berral and Nicolas Poggi
Barcelona Supercomputing Center (BSC)
Universitat Politècnica de Catalunya (BarcelonaTech)
Barcelona, Spain
Abstract—Most data scientists use nowadays functional or
semi-functional languages like SQL, Scala or R to treat data,
obtained directly from databases. Such process requires to fetch
data, process it, then store again, and such process tends to
be done outside the DB, in often complex data-flows. Recently,
database service providers have decided to integrate “R-as-a-
Service” in their DB solutions. The analytics engine is called
directly from the SQL query tree, and results are returned as
part of the same query. Here we show a first taste of such
technology by testing the portability of our ALOJA-ML analytics
framework, coded in R, to Microsoft SQL-Server 2016, one of
the SQL+R solutions released recently. In this work we discuss
some data-flow schemes for porting a local DB + analytics engine
architecture towards Big Data, focusing specially on the new
DB Integrated Analytics approach, and commenting the first
experiences in usability and performance obtained from such
new services and capabilities.
I. INTRODUCTION
Current data mining methodologies, techniques and algo-
rithms are based in heavy data browsing, slicing and process-
ing. For data scientists, also users of analytics, the capability
of defining the data to be retrieved and the operations to be
applied over this data in an easy way is essential. This is the
reason why functional languages like SQL, Scala or R are so
popular in such fields as, although these languages allow high
level programming, they free the user from programming the
infrastructure for accessing and browsing data.
The usual trend when processing data is to fetch the data
from the source or storage (file system or relational database),
bring it into a local environment (memory, distributed workers,
...), treat it, and then store back the results. In such schema
functional language applications are used to retrieve and slice
the data, while imperative language applications are used to
process the data and manage the data-flow between systems.
In most languages and frameworks, database connection pro-
tocols like ODBC or JDBC are available to enhance this data-
flow, allowing applications to directly retrieve data from DBs.
And although most SQL-based DB services allow user-written
procedures and functions, these do not include a high variety
of primitive functions or operators.
The arrival of the Big Data favored distributed frameworks
like Apache Hadoop and Apache Spark, where the data is
distributed “in the Cloud” and the data processing can also be
distributed where the data is placed, then results are joined
and aggregated. Such technologies have the advantage of
distributed computing, but when the schema for accessing data
and using it is still the same, ...
Database Integrated Analytics using R InitialExperiences wi
1. Database Integrated Analytics using R: Initial
Experiences with SQL-Server + R
Josep Ll. Berral and Nicolas Poggi
Barcelona Supercomputing Center (BSC)
Universitat Politècnica de Catalunya (BarcelonaTech)
Barcelona, Spain
Abstract—Most data scientists use nowadays functional or
semi-functional languages like SQL, Scala or R to treat data,
obtained directly from databases. Such process requires to fetch
data, process it, then store again, and such process tends to
be done outside the DB, in often complex data-flows. Recently,
database service providers have decided to integrate “R-as-a-
Service” in their DB solutions. The analytics engine is called
directly from the SQL query tree, and results are returned as
part of the same query. Here we show a first taste of such
technology by testing the portability of our ALOJA-ML
analytics
framework, coded in R, to Microsoft SQL-Server 2016, one of
the SQL+R solutions released recently. In this work we discuss
some data-flow schemes for porting a local DB + analytics
engine
architecture towards Big Data, focusing specially on the new
DB Integrated Analytics approach, and commenting the first
experiences in usability and performance obtained from such
new services and capabilities.
I. INTRODUCTION
2. Current data mining methodologies, techniques and algo-
rithms are based in heavy data browsing, slicing and process-
ing. For data scientists, also users of analytics, the capability
of defining the data to be retrieved and the operations to be
applied over this data in an easy way is essential. This is the
reason why functional languages like SQL, Scala or R are so
popular in such fields as, although these languages allow high
level programming, they free the user from programming the
infrastructure for accessing and browsing data.
The usual trend when processing data is to fetch the data
from the source or storage (file system or relational database),
bring it into a local environment (memory, distributed workers,
...), treat it, and then store back the results. In such schema
functional language applications are used to retrieve and slice
the data, while imperative language applications are used to
process the data and manage the data-flow between systems.
In most languages and frameworks, database connection pro-
tocols like ODBC or JDBC are available to enhance this data-
3. flow, allowing applications to directly retrieve data from DBs.
And although most SQL-based DB services allow user-written
procedures and functions, these do not include a high variety
of primitive functions or operators.
The arrival of the Big Data favored distributed frameworks
like Apache Hadoop and Apache Spark, where the data is
distributed “in the Cloud” and the data processing can also be
distributed where the data is placed, then results are joined
and aggregated. Such technologies have the advantage of
distributed computing, but when the schema for accessing data
and using it is still the same, just that the data distribution is
transparent to the user. Still, the user is responsible of adapting
any analytics to a Map-Reduce schema, and be responsible of
the data infrastructure.
Recently, companies like Microsoft, IBM or Cisco,
providers of Analytics as a Service platforms, put special
effort into complementing their solutions by adding script-
ing mechanisms into their DB engines, allowing to embed
analytics mechanisms into the same DB environment. All of
them selected R [17] as a language and analytics engine, a
4. free and open-source statistical oriented language and engine,
embraced by the data mining community since long time ago.
The current paradigm of “Fetch from DB, Process, Dump to
DB” is shifted towards an “In-DB Processing” schema, so the
operations to be done on the selected data are provided by
the same DB procedures catalog. All the computation remains
inside the DB service, so the daily user can proceed by simply
querying the DB in a SQL style. New R-based procedures,
built-in or user created by invoking R scripts and libraries, are
executed as regular operations inside the query execution tree.
The idea of such integration is that, not only this processing
will be more usable by querying the DB, where the data is
managed and distributed, but also will reduce the overhead
of the data pipe-line, as everything will remain inside a
single data framework. Further, for continuous data processing,
analytics procedures and functions can be directly called from
triggers when data is continuously introduced or modified.
As a case of use of such approach, in this paper we present
5. some experiences on porting the ALOJA framework [13] to
the recently released SQL-Server 2016, incorporating this R-
Service functionality. The ALOJA-ML framework is a col-
lection of predictive analytics functions (machine learning
and data mining), written in R, originally purposed for mod-
eling and prediction High Performance Computing (HPC)
benchmarking workloads as part of the ALOJA Project [16],
deployed to be called after retrieving data from a MySQL
database. Such project collects traces and profiling of Big
Data framework technologies, and analyzing this data requires
predictive analytics. These procedures are deployed in R, and
communicating the framework with the database and R engine
results in a complex architecture, with a fragile data-pipeline
and susceptible to failures.
In this work we present how we could adapt our current
data-processing approach to a more Big-Data oriented archi-
2016 IEEE 16th International Conference on Data Mining
Workshops
7. It remains for the future work to test and compare the per-
formances among platforms, also to include some experiences
with IBM PureData Services [7] and any other new platform
providing such services.
This article is structured as follows: Section II presents
the current state-of-art and recent approaches used when
processing data from databases. Section III explains the current
and new data-flow paradigm, and required changes. Section IV
shows some code porting examples from our native R scripts to
Microsoft SQL Server. Section V provides comments and de-
tails on some experiments on running the ALOJA framework
in the new architecture. Finally, section VI summarizes this
current work and presents the conclusions and future work.
II. STATE OF THE ART
Current efforts on systems processing Big Data are mostly
focused on building and improving distributed systems. Plat-
forms like Apache Hadoop [1] and Apache Spark [4], with
all their “satellite” technologies, are on the rise on Big Data
8. processing environments. Those platforms, although being
originally designed towards Java or Python applications, the
significant weight of the data mining community using R,
Scala or using SQL interfaces, encouraged the platforms strate-
gists over the years to include interfaces and methodologies for
using such languages. Revolution Analytics published in 2011
RHadoop [9], a set of packages for R users to launch Hadoop
tasks. Such packages included HDFS [14] and HBase [2] han-
dlers, with Map-Reduce and data processing function libraries
adapted for Hadoop. This way, R scripts could dispatch par -
allelizable functions (e.g. R “apply” functions) to be executed
in distributed worker computing machines.
The Apache Spark platform, developed by the Berkeley’s
AMPlab [11] and Databricks [5] and released in 2014, focuses
on four main applied data science topics: graph processing,
machine learning, data streaming, and relational algebraic
queries (SQL). For these, Spark is divided in four big pack-
ages: GraphX, MLlib, SparkStream, and SparkSQL. SparkSQL
9. provides a library for treating data through SQL syntax or
through relational algebraic functions. Also recently, a R
scripting interface has been added to the initial Java, Python
and Scala interfaces, through the SparkR package, providing
the Spark based parallelism functions, Map-Reduce and HBase
or Hive [3] handlers. This way, R users can connect to
Spark deployments and process data frames in a Map-Reduce
manner, the same way they could do with RHadoop or using
other languages.
Being able to move the processing towards the database
side becomes a challenge, but allows to integrate analytics
into the same data management environment, letting the same
framework that receives and stores data to process it, in the
way it is configured (local, distributed...). For this purpose,
companies providing database platforms and services put effort
in adding data processing engines as integrated components to
their solutions. Microsoft recently acquired Revolution Analyt-
ics and its R engine, re-branded as R-Server [8], and connected
10. to the Microsoft SQL-Server 2016 release [10]. IBM also
released their platform Pure Data Systems for Analytics [7],
providing database services including the vanilla R engine
from the Comprehensive R Archive Network [17]. Also Cisco
recently acquired ParStream [6], a streaming database product
incorporating user defined functions, programmable as shared
object libraries in C++ or as external scripts in R.
Here we describe our first approach to integrate our R-based
analytics engine into a SQL+R platform, the Microsoft SQL-
Server 2016, primarily looking at the user experience, and
discussing bout cases of use where one architecture would be
preferred over others.
III. DATA-FLOW ARCHITECTURES
Here we show three basic schemes of data processing,
the local ad-hoc schema of pull-process-push data, the new
distributed schemes for Hadoop and Spark, and the “In-
DataBase” approach using the DB integrated analytics ser-
vices. Point out that there is not an universal schema that
works for each situation, and each one serves better or
worse depending on the situation. As an example we put the
11. case of the ALOJA-ML framework as an example of such
architectures, how it currently operates, the problematic, and
how it would be adapted to new approaches.
A. Local Environments
In architectures where the data and processing capacity is in
the same location, having data-sets stored in local file systems
or direct access DBs (local or remote databases where the
user or the application can access to retrieve or put data).
When an analytics application wants to process data, it can
access the DB to fetch the required data, store locally and
then pass to the analytics engine. Results are collected by
the application and pushed again to the DB, if needed. This
is a classical schema before having distributed systems with
distributed databases, or for systems where the processing is
not considered big enough to build a distributed computing-
power environment, or when the process to be applied on data
cannot be distributed.
For systems where the analytics application is just a user
12. of the data, this schema might be the corresponding one, as
the application fetches the data it is granted to view, then
2
do whatever it wants with it. Also for applications where
computation does not require select big amounts of data, as
the data required is a small fraction of the total Big Data, it is
affordable to fetch the slice of data and process it locally. Also
on systems where using libraries like snowfall [X], letting the
user to tune the parallelism of R functions, can be deployed
locally up to a point where this distribution requires heavier
mechanisms like Hadoop or Spark.
There are mechanisms and protocols, like ODBC or JDBC
(e.g. RODBC library for R), allowing applications to access
directly to DBs and fetch data without passing through the file
system, but through the application memory space. This is, in
case that the analytics application has granted direct access to
the DB and understands the returning format. Figure 1 shows
13. both archetypical local schemes.
Fig. 1. Schema of local execution approaches, passing data
through the File
System or ODBC mechanisms
In the first case of Figure 1 we suppose an scenario where
the engine has no direct access to DB, as everything that
goes into analytics is triggered by the base application. Data
is retrieved and pre-processed, then piped to the analytics
engine through the file system (files, pipes, ...). This requires
coordination between application and engine, in order to
communicate the data properly. Such scenarios can happen
when the DB, the application and the engine do not belong
to the same organization, or when they belong to services
from different providers. Also it can happen when security
issues arise, as the analytics can be provided by a not-so-
trusted provider, and data must be anonymized before exiting
the DB.
Also, if the DB and analytics engine are provided by the
same service provider, there will be means to communicate the
14. storage with the engine in an ODBC or other way, allowing
to fetch the data, process, and then return the results to the
DB. As an example, the Microsoft Azure-ML services [15]
allow the connection of its machine learning components with
the Azure storage service, also the inclusion of R code as
part of user-defined components. In such service, the machine
learning operations do not happen on the storage service, but
data is pulled into the engine and then processed.
At this moment, the ALOJA Project, consists of a web user
interface (front-end and back-end), a MySQL database and
the R engine for analytics, used this schema of data pipe-line.
The web interface provided the user the requested data from
the ALOJA database, and then displayed in the front-end. If
analytics were required, the back-end dumped the data to be
treated into the file system, and then started the R engine.
Results were collected by the back-end and displayed (also
stored in cache). As most of the information passed through
user filters, this data pipe-lining was preferred over the direct
15. connection of the scripts with the DB. Also this maintained the
engine independent of the DB queries in constant development
at the main framework.
Next steps for the project are planned towards incorporating
a version of the required analytics into the DB, so the back-
end of the platform can query any of the provided analytics,
coded as generic for any kind of input data, as a single SQL
query.
B. Distributed Environments
An alternative architecture for the project would be to
upload the data into a Distributed Database (DDB), thinking
on expanding the ALOJA database towards Big Data (we still
have Terabytes of data to be unpacked into the DB and to be
analyzed at this time). The storage of the data could be done
using Hive or HBase technologies, also the analytics could be
adapted towards Hadoop or Spark. For Hadoop, the package
RHadoop could handle the analytics, while the SparkSQL +
SparkR packages could do the same for Spark. Processing
16. the analytics could be done on a distributed system with
worker nodes, as most of the analytics in our platform can be
parallelized. Further, Hadoop and Spark have machine learning
libraries (Mahout and MLlib) that could be used as native
instead of some functionalities of our ALOJA-ML framework.
Figure 2 shows the execution schema of this approach.
Such approach would concentrate all the data retrieval
and processing in a distributed system, not necessarily near
the application but providing parallelism and agility on data
browsing, slicing and processing. However, this requires a
complex set-up for all the involved services and machines,
and the adaption of the analytics towards parallelism in a
different level than when using snowfall or other packages.
SparkR provides a Distributed Data-Frame structure, with
similar properties than regular R data-frames, but due to the
distributed property, some classic operations are not available
3
Fig. 2. Schema of distributed execution approach, with
17. Distributed Databases
or File Systems
or must be performed in a different way (e.g. column binding,
aggregates like “mean”, “count”...).
This option would be chosen at that point where data is
large enough to be dealt with a single system, considering that
DB managers do not provide means to distribute and retrieve
data. Also, like in the previous approach, the data pipe-line
passes through calling the analytics engine to produce the
queries and invoke (in a Map-Reduce manner) the analytics
when parallelizable, or collect locally data to apply the non-
parallelizable analytics. The price to be paid would be the set-
up of the platform, and the proper adjustment of the analytics
towards a Map-Reduce schema.
C. Integrated Analytics
The second alternative to the local analytics schema is to
incorporate those functions to the DB services. At this point,
the DB can offer Analytics as a Service, as users can query
for analytics directly to the DB in a SQL manner, and data
18. will be directly retrieved, processed and stored by the same
framework, avoiding the user to plan a full data pipe-line.
As the presented services and products seem to admit R
scripts and programs directly, no adaption of the R code
is required. The effort must be put in coding the wrapping
procedures for being called from a SQL interface. Figure 3
shows the basic data-flow for this option.
Fig. 3. Schema of “In-DB” execution approach
The advantages of such approach are that 1) analytics, data
and data management are integrated in the same software
package, and the only deployment and configuration is of
the database; 2) the management of distributed data in Big
Data deployments will provided by the same DB software
(whether this is implemented as part of the service!), and the
user doesn’t need to implement Map-Reduce functions but data
can be aggregated in the same SQL query; 3) simple DB users
can invoke the analytics though a SQL query, also analytics
can be programmed generically to be applied with any required
19. input SQL query result; 4) DBs providing triggers can produce
Continuous Analytic Queries each time data is introduced or
modified.
This option still has issues to be managed, such as the capac-
ity of optimization and parallelism of the embedded scripts. In
the case of the Microsoft R Service, any R code can be inserted
inside a procedure, without any apparent optimization to be
applied as any script is directly sent to a external R sub-service
managed by the principal DB service. Such R sub-service
promises to apply multi-threading in Enterprise editions, as the
classic R engine is single-thread, and multi-threading could
be applied to vectorization functions like “apply”, without
having to load the previously mentioned snowfall (loadable
as the script runs on a compatible R engine). Also, comparing
this approach to the distributed computing system, if the
service supports “partitioning” of data according to determined
columns and values, SQL queries could be distributed in a
cluster of machines (not as replication, but as distribution),
and aggregated after that.
IV. ADAPTING AND EMBEDDING CODE
20. According to the SQL-Server published documentation and
API, the principal way to introduce external user-defined
scripts is to wrap the R code inside a Procedure. The SQL-
Server procedures admit the script, that like any “Rscript” can
source external R files and load libraries (previously copied
into the corresponding R library path set up by the Microsoft
R Service), also admit the parameters to be bridged towards
the script, and also admit SQL queries to be executed previous
to start the script for filling input data-frames. The procedure
also defines the return values, in the form of data-frames as
tables, values or a tuple of values.
When the wrapping procedure is called, input SQL queries
are executed and passed as input data-frame parameters, direct
parameters are also passed to the script, then the script is
executed in the R engine as it would do a “Rscript” or a script
dumped into the R command line interface. The variables
mapped as outputs are returned from the procedure into the
invoking SQL query.
21. Figure 4 shows an example of calling ALOJA-ML functions
in charge of learning a linear model from the ALOJA data-
set, read from a file while indicating the input and output
variables; also calling a function to predict a data-set using
the previously created model. The training process creates a
model and its hash ID, then the prediction process applies the
model to all the testing data-set. In the current ALOJA data
pipe-line, “ds” should be retrieved from the DB (to a file to
be read, or directly to a variable if RODBC is used). Then
4
source("functions.r");
library(digest);
## Training Process
model <- aloja_linreg(ds = read.table("aloja6.csv"), vin =
c("maps","iofilebuf"), vout = "exe_time");
id_hash <- digest(x = model, algo = "md5"); # ID for storing the
model in DB
## Prediction Example
predictions <- aloja_predict_dataset(learned_model = model, ds
= read.table("aloja6test.csv"));
22. Fig. 4. Example of modeling and prediction example using the
ALOJA-ML libraries. Loading ALOJA-ML functions allow the
code to execute “aloja linreg”
to model a linear regression, also “aloja predict dataset” to
process a new data-set using a previously trained model.
the results (“model”, “id hash” and “predictions”) should be
reintroduced to the DB using again RODBC, or writing the
results into a file and the model into a serialized R object file.
Figures 5 and 6 show how this R code is wrapped as
procedure, and examples of how these procedures are invoked.
Using this schema, in the modeling procedure, “ds” would be
passed to the procedure as a SQL query, “vin” and “vout”
would be bridged parameters, and the “model” and “id hash”
would be returned as a two values (a serialized blob/string
and a string) that can be saved into a models table. The return
would be introduced into the models table. Also the prediction
procedure would admit an “id hash” for retrieving the model
(using a SQL query inside the procedure), and would return a
data-frame/table with the row IDs and the predictions.
We observed that, when preparing the embedded script, all
23. sources, libraries and file paths must be prepared like a Rscript
to be executed from a command line. The environment for the
script must be set-up, as it will be executed each time in a
new R session.
In the usual work-flow on ALOJA-ML tools, models (R
serialized objects) are stored in the file system, and then
uploaded in the database as binary blobs. Working directly
from the DB server allow to directly encode serialized objects
into available DB formats. Although SQL-Server includes a
“large binary” data format, we found some problems when
returning binary information from procedures (syntax not
allowing the return of such data type in tuples), thus serialized
objects can be converted to text-formats like base64 to be
stored as a “variable size character array”.
V. EXPERIMENTS AND EXPERIENCES
A. Usability and Performance
For testing the approach we used the SQL-Server 2016
Basic version, with the R Service installed and with Windows
Server 2012, from a default image available at the Azure
24. repositories. Being the basic service, and not the enterprise,
we assumed that R Services would not provide improvements
on multi-threading, so additional packages for such functions
are installed (snowfall). The data from the ALOJA data-set is
imported through the CSV importing tools in the SQL-Server
Manager framework, the “Visual-Studio”-like integrated de-
velopment environment (IDE) for managing the services.
Data importing displayed some problems, concerning to
data-type transformation issues, as some numeric columns
couldn’t be properly imported due to precision and very big
values, and had to be treated as varchar (thus, not treatable
as number). After making sure that the CSV data is properly
imported, the IDE allowed to perform simple queries (SE-
LECT, INSERT, UPDATE). At this point, tables for storing
pair-value entries, containing the models with their specific
ID hash as key, is created. Procedures wrapping the basic
available functions of ALOJA-ML are created, as specified
in previous section IV. After executing some example calls,
the system works as expected.
25. During the process of creating the wrapping procedures
we found some issues, probably non-reported bugs or system
internal limitations, like the fact that a procedure can return a
large binary type (large blob in MySQL and similar solutions),
also can return tuples of diverse kinds of data types, but it
crashed with an internal error when trying to return a tuple
of a varchar and a large binary. Workarounds were found
by converting the serialized model object (binary type) into
a base64 encoding string (varchar type), to be stored with
its ID hash key. As none information about this issue was
found in the documentation, at the day of registering these
experiences, we expect that such issues will be solved by the
development team in the future.
We initially did some tests using the modeling and pre-
diction ALOJA-ML functions over the data, and comparing
times with a local “vanilla” R setup, performance is almost
identical. This indicating that with this “basic” version, the R
Server (former R from Revolution Analytics) is still the same
at this point.
Another test done was to run the outliers classification
function of our framework. The function, explained in detail
26. in its corresponding work [13], compares the original output
variables with their predictions, and if the difference is k times
greater than the expected standard deviation plus modeling er -
ror, and it doesn’t have enough support from similar values on
the rest of the data-set, such data is considered an outlier. This
implies the constant reading of the data table for prediction
and for comparisons between entries. The performance results
were similar to the ones on a local execution environment,
measuring only the time spent in the function. In a HDI-
A2 instance (2 virtual core, only 1 used, 3.5GB memory),
5
%% Creation of the Training Procedure, wrapping the R call
CREATE PROCEDURE dbo.MLPredictTrain @inquery
nvarchar(max), @varin nvarchar(max),
@varout nvarchar(max) AS
BEGIN
EXECUTE sp_execute_external_script
@language = N’R’,
@script = N’
source("functions.r");
library(digest); library(base64enc);
27. model <- aloja_linreg(ds = InputDataSet, vin =
unlist(strsplit(vin,",")), vout = vout);
serial <- as.raw(serialize(model, NULL));
OutputDataSet <- data.frame(model = base64encode(serial),
id_hash = digest(serial, algo = "md5"));
’,
@input_data_1 = @inquery,
@input_data_1_name = N’InputDataSet’,
@output_data_1_name = N’OutputDataSet’,
@params = N’@vin nvarchar(max), @vout nvarchar(max)’,
@vin = @varin,
@vout = @varout
WITH RESULT SETS (("model" nvarchar(max), "id_hash"
nvarchar(50)));
END
%% Example of creating a model and storing into the DB
INSERT INTO aloja.dbo.trained_models (model, id_hash)
EXEC dbo.MLPredictTrain @inquery = "SELECT exe_time,
maps, iofilebuf FROM aloja.dbo.aloja6",
@varin = "maps,iofilebuf", @varout = "exe_time"
Fig. 5. Version of the modeling call for ALOJA-ML functions in
a SQL-Server procedure. The procedure generates the data-set
for “aloja linreg” from a
parametrized query, also bridges the rest of parameters into the
script. It also indicates the format of the output, being a value, a
tuple or a table (data frame).
%% Creation of the Predicting Procedure, wrapping the R call
CREATE PROCEDURE dbo.MLPredict @inquery
nvarchar(max), @id_hash nvarchar(max) AS
BEGIN
DECLARE @modelt nvarchar(max) = (SELECT TOP 1 model
28. FROM aloja.dbo.trained_models
WHERE id_hash = @id_hash);
EXECUTE sp_execute_external_script
@language = N’R’,
@script = N’
source("functions.r");
library(base64enc);
results <- aloja_predict_dataset(learned_model =
unserialize(as.raw(base64decode(model))),
ds = InputDataSet);
OutputDataSet <- data.frame(results);
’,
@input_data_1 = @inquery,
@input_data_1_name = N’InputDataSet’,
@output_data_1_name = N’OutputDataSet’,
@params = N’@model nvarchar(max)’,
@model = @modelt;
END
%% Example of predicting a dataset from a SQL query with a
previously trained model in DB
EXEC aloja.dbo.MLPredict @inquery = ’SELECT exe_time,
maps, iofilebuf FROM aloja.dbo.aloja6test’,
@id_hash = ’aa0279e9d32a2858ade992ab1de8f82e’;
Fig. 6. Version of the Prediction call for ALOJA-ML functions
in a SQL-Server procedure. Like the training procedure in
figure 5, the procedure primarily
retrieves the data to be processed from a SQL query, and passes
it with the rest of parameters into the script. Here the result is
directly a table (data frame).
29. it took 1h:9m:56s to process the 33147 rows, selecting just 3
features. Then, as a way to improve the performance, due to
the limitations of the single-thread R Server, we loaded snow-
fall, and invoked it from the ALOJA-ML “outlier dataset”
function, on a HDI-A8 instance (8 virtual core, all used, 14GB
memory). The data-set was processed in 11m:4s, barely 1/7
6
of the previous time, considering the overhead of sharing data
among R processes created by snowfall, demonstrating that
despite not being a multi-threaded set-up, using the traditional
resources available on R it is possible to scale R procedures.
B. Discussion
One of the concerns on the usage of such service is, despite
and because of the capability of multi-processing using built-
in or loaded libraries, the management of the pool of R
processes. R is not just a scripting language to be embedded
on a procedure, but it is a high-level language that allows from
creating system calls to parallelizing work among networked
30. working nodes. Given the complexity that a R user created
function can achieve, in those cases that such procedure is
heavily requested, the R server should be able to be detached
from the SQL-server and able in dedicated HPC deployments.
The same way snowfall can be deployed for multi-threading
(also for cluster-computing), clever hacks can be created by
loading RHadoop or SparkR inside a procedure, connecting
the script with a distributed processing system. As the SQL-
server bridges tables and query results as R data frames, such
data frames can be converted to Hadoop’s Resilient Distributed
Data-sets or Spark’s Distributed Data Frames, uploaded to a
HDFS, processed, then returned to the database. This could
bring to a new architecture of SQL-Server (or equivalent
solutions) to connect to distributed processing environments,
as slave HPC workers for the database. Also an improvement
could be that the same DB-server, instead of producing input
table/data frames already returned Distributed Data Frames,
being the data base distributed into working nodes (in a
partitioning way, not a replication way). All in all, the fact
31. that the embedded R code is directly passed to a nearly-
independent R engine allows to do whatever a data scientist
can do with a typical R session.
VI. SUMMARY AND CONCLUSIONS
The incorporation of R, the semi-functional programming
statistical language, as an embedded analytics service into
databases will suppose an improvement on the ease and
usability of analytics over any kind of data, from regular to
big amounts (Big Data). The capability of data scientists to
introduce their analytics functions as a procedure in databases,
avoiding complex data-flows from the DB to analytics engines,
allow users and experts a quick tool for treating data in-situ
and continuously.
This study discussed some different architectures for data
processing, involving fetching data from DBs, distributing data
and processing power, and embedding the data process into the
DB. All of this using the ALOJA-ML framework as reference,
a framework written in R dedicated to model, predict and
32. classify data from Hadoop executions, stored as the ALOJA
data-set. The shown examples and cases of use correspond
to the port of the current ALOJA architecture towards SQL-
Server 2016, integrating R Services.
After testing the analytics functions after the porting into
an SQL database, we observed that the major effort for this
porting are in the wrapping SQL structures for incorporating
the R calls into the DB, without modifying the original R code.
As performance results similar than R standalone distributions,
the advantages come from the input data retrieving and storing.
In future work we plan to test the system more in-depth, and
also compare different SQL+R solutions, as companies offer -
ing DB products have started putting efforts into integrating
R engines in their DB platforms. As this study focused more
in an initial hands-on with this new technology, future studies
will focus more on comparing performance, also against other
architectures for processing Big Data.
ACKNOWLEDGMENTS
33. This project has received funding from the European Research
Council (ERC) under the European Union’s Horizon 2020
research
and innovation programme (grant agreement No 639595).
REFERENCES
[1] Apache Hadoop. http://hadoop.apache.org (Aug 2016).
[2] Apache HBase. https://hbase.apache.org/ (Aug 2016).
[3] Apache Hive. https://hive.apache.org/ (Aug 2016).
[4] Apache Spark. https://spark.apache.org/ (Aug 2016).
[5] Databricks inc. https://databricks.com/ (Aug 2016).
[6] ParStream. Cisco corporation.
http://www.cisco.com/c/en/us/products/
analytics-automation-software/parstream/index.html (Aug
2016).
[7] PureData Systems for Analytics. IBM corporation.
https://www-01.ibm.
com/software/data/puredata/analytics/ (Aug 2016).
[8] R-Server. Microsoft corporation.
https://www.microsoft.com/en-us/
cloud-platform/r-server (Aug 2016).
[9] RHadoop. Revolution Analytics. https://github.com/
RevolutionAnalytics/RHadoop/wiki (Aug 2016).
[10] SQL-Server 2016. Microsoft corporation.
https://www.microsoft.com/
en-us/cloud-platform/sql-server (Aug 2016).
[11] UC Berkeley, AMPlab. https://amplab.cs.berkeley.edu/
34. (Aug 2016).
[12] Barcelona Supercomputing Center. ALOJA home page.
http://aloja.bsc.
es/ (Aug 2016).
[13] J. L. Berral, N. Poggi, D. Carrera, A. Call, R. Reinauer,
and D. Green.
ALOJA-ML: A framework for automating characterization and
knowl-
edge discovery in hadoop deployments. In Proceedings of the
21th ACM
SIGKDD International Conference on Knowledge Discovery and
Data
Mining, Sydney, NSW, Australia, August 10-13, 2015, pages
1701–1710,
2015.
[14] D. Borthakur. The Hadoop Distributed File System:
Architecture and
Design.
http://hadoop.apache.org/docs/r0.18.0/hdfs design.pdf. The
Apache
Software Foundation, 2007.
[15] Microsoft Corporation. Azure 4 Research.
http://research.microsoft.
com/en-us/projects/azure/default.aspx (Jan 2016).
[16] N. Poggi, J. L. Berral, D. Carrera, A. Call, F. Gagliardi, R.
Reinauer,
N. Vujic, D. Green, and J. A. Blakeley. From performance
profiling to
predictive analytics while evaluating hadoop cost-efficiency in
ALOJA.
In 2015 IEEE International Conference on Big Data, Big Data
35. 2015,
Santa Clara, CA, USA, October 29 - November 1, 2015, pages
1220–
1229, 2015.
[17] R Core Team. R: A Language and Environment for
Statistical Comput-
ing. R Foundation for Statistical Computing, Vienna, Austria,
2014.
7
Managing Big Data Analytics Workflows
with a Database System
Carlos Ordonez Javier Garcı́a-Garcı́a
University of Houston UNAM University
USA Mexico
Abstract—A big data analytics workflow is long and complex,
with many programs, tools and scripts interacting together. In
general, in modern organizations there is a significant amount
of big data analytics processing performed outside a database
system, which creates many issues to manage and process big
data analytics workflows. In general, data preprocessing is the
most time-consuming task in a big data analytics workflow. In
this work, we defend the idea of preprocessing, computing
models
and scoring data sets inside a database system. In addition, we
discuss recommendations and experiences to improve big data
36. analytics workflows by pushing data preprocessing (i.e. data
cleaning, aggregation and column transformation) into a
database
system. We present a discussion of practical issues and common
solutions when transforming and preparing data sets to improve
big data analytics workflows. As a case study validation, based
on
experience from real-life big data analytics projects, we
compare
pros and cons between running big data analytics workflows
inside and outside the database system. We highlight which
tasks
in a big data analytics workflow are easier to manage and faster
when processed by the database system, compared to external
processing.
I. INTRODUCTION
In a modern organization transaction processing [4] and data
warehousing [6] are managed by database systems. On the
other hand, big data analytics projects are a different story.
Despite existing data mining [5], [6] functionality offered by
most database systems many statistical tasks are generally
performed outside the database system [7], [6], [15]. This
is due to the existence of sophisticated statistical tools and
libraries, the lack of expertise of statistical analysts to write
correct and efficient SQL queries, a somewhat limited set of
37. statistical algorithms and techniques in the database system
(compared to statistical packages) and abundance of legacy
code (generally well tuned and difficult to rewrite in a different
language). In such scenario, users exploit the database system
just to extract data with ad-hoc SQL queries. Once large tables
are exported they are summarized and further transformed
depending on the task at hand, but outside the database system.
Finally, when the data set has the desired variables (features),
statistical models are computed, interpreted and tuned. In
general, preprocessing data for analysis is the most time
consuming task [5], [16], [17] because it requires cleaning,
joining, aggregating and transforming files, tables to obtain
“analytic” data sets appropriate for cube or machine learning
processing. Unfortunately, in such environments manipulating
data sets outside the database system creates many data man-
agement issues: data sets must be recreated and re-exported
every time there is a change, models need to be imported
back into the database system and then deployed, different
38. users may have inconsistent versions of the same data set in
their own computers, security is compromised. Therefore, we
defend the thesis that it is better to transform data sets and
compute statistical models inside the database system. We
provide evidence such approach improves management and
processing of big data analytics workflows. We argue database
system-centric big data analytics workflows are a promising
processing approach for large organizations.
The experiences reported in this work are based on analytics
projects where the first author participated as as developer
and consultant in a major DBMS company. Based on ex-
perience from big data analytics projects we have become
aware statistical analysts cannot write correct and efficient
SQL code to extract data from the database system or to
transform their data sets for the big data analytics task at hand.
To overcome such limitations, statistical analysts use big data
analytics tools, statistical software, data mining packages to
manipulate and transform their data sets. Consequently, users
39. end up creating a complicated collection of programs mixing
data transformation and statistical modeling tasks together. In
a typical big data analytics workflow most of the development
(programming) effort is spent on transforming the data set: this
is the main aspect studied in this article. The data set represents
the core element in a big data analytics workflow: all data
transformation and modeling tasks must work on the entire
data set. We present evidence running workflows entirely
inside a database system improves workflow management and
accelerates workflow processing.
The article is organized as follows. Section II provides
the specification of the most common big data analytics
workflows. Section III discusses practical issues and solutions
when preparing and transforming data sets inside a database
system. We contrast processing workflows inside and outside
the database system. Section IV compares advantages and
time performance of processing big data analytics workflows
inside the database system and outside exploiting with external
41. has been produced.
A typical big data analytics workflow to compute a model
consists of the following major tasks (this basic workflow can
vary depending on users knowledge and focus):
1) Data preprocessing: Building data sets for analysis (se-
lecting records, denormalization and aggregation)
2) Exploratory analysis (descriptive statistics, OLAP)
3) Computing statistical models
4) Tuning models: add, modify data set attributes, fea-
ture/variable selection, train/test models
5) Create scoring scripts based on selected features and best
model
This modeling workflow tends to be linear (Task i executed
before Task i+1), but in general it has cycles. This is because
there are many iterations building data sets, analyzing variable
behavior and computing models before and acceptable can
be obtained. Initially the bottleneck of the workflow is data
preprocessing. Later, when the data mining project evolves
most workflow processing is testing and tuning the desired
42. model. In general, the database system performs some denor -
malization and aggregation, but it is common to perform most
transformations outside the database system. When features
need to be added or modified queries need to be recomputed,
going back to Task 1. Therefore, only a part of the first task
of the workflow is computed inside the database system; the
remaining stages of the workflow are computed with external
data mining tools.
The second workflow we consider is actually deploying a
model with arriving (new) data. This task is called scoring [7],
[12]. Since the best features and the best model are already
known this is generally processed inside the database system.
A scoring workflow is linear and non-iterative: Processing
starts with Task 1, Task i is executed before Task i + 1 and
processing ends with the last task. In a scoring workflow tasks
are as follows:
1) Data preprocessing: Building data set for scoring (select
records, compute best features).
43. 2) Deploy model on test data set (e.g predict class or target
value), store model output per record as a new table or
view in the database system.
3) Evaluate queries and produce summary reports on scored
data sets joined with source tables. Explain models by
tracing data back to source tables in the database system.
Building the data set tends to be most important task in both
workflows since it requires gathering data from many source
tables. In this work, we defend the idea of preprocessing big
data inside a database system. This approach makes sense only
when the raw data originates from a data warehouse. That
is, we do not tackle the problem of migrating processing of
external data sets into the database system.
III. TASKS TO PREPARE DATA SETS
We discuss issues and recommendations to improve the data
processing stage in a big data analytics workflow. Our main
motivation is to bypass the use of external tools to perform
data preprocessing, using the SQL language instead.
44. A. Reasons for Processing big data analytics workflows Out-
side the database system
In general, the main reason is of a practical nature: users
do not want to translate existing code working on external
tools. Commonly such code has existed for a long time
(legacy programs), it is extensive (there exist many programs)
and it has been debugged and tuned. Therefore, users are
reluctant to rewrite it in a different language, given associated
risks. A second complaint is that, in general, the database
system provides elementary statistical functionality, compared
to sophisticated statistical packages. Nevertheless, such gap
has been shrinking over the years. As explained before, in a
data mining environment most user time is spent on preparing
data sets for analytic purposes. We discuss some of the most
important database issues when tables are manipulated and
transformed to prepare a data set for data mining or statistical
analysis.
B. Main Data Preprocessing Tasks
We consider three major tasks:
45. 1) selecting records for analysis (filtering)
2) denormalization (including math transformation)
3) aggregation
Selecting Records for Analysis: Selection of rows can be
done at several stages, in different tables. Such filtering is done
to discard outliers, to discard records with a significant missing
information content (including referential integrity [14]), to
discard records whose potential contribution to the model
provides no insight or sets of records whose characteristics
deserve separate analysis. It is well known that pushing
selection is the basic strategy to accelerate SPJ (select-project-
join) queries, but it is not straightforward to apply into multiple
queries. A common solution we have used is to perform as
much filtering as possible on one data set. This makes code
maintenance easier and the query optimizer is able to exploit
filtering predicates as much as possible.
In general, for a data mining task it is necessary to select
a set of records from one of the largest tables in the database
46. based on a date range. In general, this selection requires a
scan on the entire table, which is slow. When there is a
secondary index based on date it is more efficient to select
rows, but it is not always available. The basic issue is that such
650
transaction table is much larger compared to other tables in the
database. For instance, this time window defines a set of active
customers or bank accounts that have recent activity. Common
solutions to this problem include creating materialized views
(avoiding join recomputation on large tables) and lean tables
with primary keys of the object being analyzed (record,
product, etc) to act as filter in further data preparation tasks.
Denormalization: In general, it is necessary to gather data
from many tables and store data elements in one place.
It is well-known that on-line transaction processing (OLTP)
database systems update normalized tables. Normalization
makes transaction processing faster and ACID [4] semantics
are easier to ensure. Queries that retrieve a few records
47. from normalized tables are relatively fast. On the other hand,
analysis on the database requires precisely the opposite: a large
set of records is required and such records gather information
from many tables. Such processing typically involves complex
queries involving joins and aggregations. Therefore, normal -
ization works against efficiently building data sets for analytic
purposes. One solution is to keep a few key denormalized
tables from which specialized tables can be built. In general,
such tables cannot be dynamically maintained because they
involve join computation with large tables. Therefore, they
are periodically recreated as a batch process. The query
shown below builds a denormalized table from which several
aggregations can be computed (e.g. similar to cube queries).
SELECT
customer_id
,customer_name
,product.product_id
,product_name
,department.department_id
,department_name
FROM sales
48. JOIN product
ON sales.product_id=product.product_id
JOIN department
ON product.department_id
=department.department_id;
For analytic purposes it is always best to use as much
data as possible. There are strong reasons for this. Statistical
models are more reliable, it is easier to deal with missing
information, skewed distributions, discover outliers and so on,
when there is a large data set at hand. In a large database
with tables coming from a normalized database being joined
with tables used in the past for analytic purposes may involve
joins with records whose foreign keys may not be found
in some table. That is, natural joins may discard potentially
useful records. The net effect of this issue is that the resulting
data set does not include all potential objects (e.g. records,
products). The solution is define a universe data set containing
all objects gathered with union from all tables and then use
such table as the fundamental table to perform outer joins. For
49. simplicity and elegance, left outer joins are preferred. Then
left outer joins are propagated everywhere in data preparation
and completeness of records is guaranteed. In general such
left outer joins have a “star” form on the joining conditions,
where the primary key of the master table is left joined with
the primary keys of the other tables, instead of joining them
with chained conditions (FK of table T1 is joined with PK
of table T2, FK of table T2 is joined with PK of T3, and so
on). The query below computes a global data set built from
individual data sets. Notice data sets may or may not overlap
each other.
SELECT
,record_id
,T1.A1
,T2.A2
..
,Tk.Ak
FROM T_0
LEFT JOIN T1 ON T_0.record_id= T1.record_id
LEFT JOIN T2 ON T_0.record_id= T2.record_id
..
LEFT JOIN Tk ON T_0.record_id= Tk.record_id;
Aggregation: Unfortunately, most data mining tasks require
50. dimensions (variables) that are not readily available from
the database. Such dimensions typically require computing
aggregations at several granularity levels. This is because most
columns required by statistical or machine learning techniques
require measures (or metrics), which translate as sums or
counts computed with SQL. Unfortunatel y, granularity levels
are not hierarchical (like cubes or OLAP [4]) making the use
of separate summary tables necessary (e.g. summarization by
product or by customer, in a retail database). A straightforward
optimization is to compute as many dimensions in the same
statement exploiting the same group-by clause, when possible.
In general, for a statistical analyst it is best to create as many
variables (dimensions) as possible in order to isolate those
that can help build a more accurate model. Then summariza-
tion tends to create tables with hundreds of columns, which
make query processing slower. However, most state-of-the-
art statistical and machine learning techniques are designed
to perform variable (feature) selection [7], [10] and many of
51. those columns end up being discarded.
A typical query to derive dimensions from a transaction
table is as follows:
SELECT
customer_id
,count(*) AS cntItems
,sum(salesAmt) AS totalSales
,sum(case when salesAmt<0 then 1 end)
AS cntReturns
FROM sales
GROUP BY customer_id;
Transaction tables generally have two or even more levels
of detail, sharing some columns in their primary key. The
typical example is store transaction table with individual
items and the total count of items and total amount paid.
This means that many times it is not possible to perform a
statistical analysis only from one table. There may be unique
pieces of information at each level. Therefore, such large
651
tables need to be joined with each other and then aggregated
at the appropriate granularity level, depending on the data
52. mining task at hand. In general, such queries are optimized
by indexing both tables on their common columns so that
hash-joins can be used.
Combining Denormalization and Aggregation: Multiple pri-
mary keys: A last aspect is considering aggregation and
denormalization interacting together when there are multiple
primary keys. In a large database different subsets of tables
have different primary keys. In other words, such tables are
not compatible with each other to perform further aggregation.
The key issue is that at some point large tables with different
primary keys must be joined and summarized. Join operations
will be slow because indexing involves foreign keys with large
cardinalities. Two solutions are common: creating a secondary
index on the alternative primary key of the largest table, or
creating a denormalized table having both primary keys in
order to enable fast join processing.
For instance, consider a data mining project in a bank that
requires analysis by customer id, but also account id. One
customer may have multiple accounts. An account may have
53. multiple account holders. Joining and manipulating such tables
is challenging given their sizes.
C. Workflow Processing
Dependent SQL statements: A data transformation script is
a long sequence of SELECT statements. Their dependencies
are complex, although there exists a partial order defined by
the order in which temporary tables and data sets for analysis
are created. To debug SQL code it is a bad idea to create
a single query with multiple query blocks. In other words,
such SQL statements are not amenable to the query optimizer
because they are separate, unless it can keep track of historic
usage patterns of queries. A common solution is to create
intermediate tables that can be shared by several statements.
Those intermediate tables commonly have columns that can
later be aggregated at the appropriate granularity levels.
Computer resource usage: This aspect includes both disk
and CPU usage, with the second one being a more valuable
resource. This problem gets compounded by the fact that
most data mining tasks work on the entire data set or large
54. subsets from it. In an active database environment running
data preparation tasks during peak usage hours can degrade
performance since, generally speaking, large tables are read
and large tables are created. Therefore, it is necessary to
use workload management tools to optimize queries from
several users together. In general, the solution is to give data
preparation tasks a lower priority than the priority for queries
from interactive users. On a longer term strategy, it is best
to organize data mining projects around common data sets,
but such goal is difficult to reach given the mathematical
nature of analysis and the ever changing nature of variables
(dimensions) in the data sets.
Comparing views and temporary tables: Views provide
limited control on storage and indexing. It may be better
to create temporary tables, especially when there are many
primary keys used in summarization. Nevertheless, disk space
usage grows fast and such tables/views need to be refreshed
when new records are inserted or new variables (dimensions)
are created.
55. Scoring: Even though many models are built outside the
database system with statistical packages and data mining
tools, in the end the model must be applied inside the database
system [6]. When volumes of data are not large it is feasible
to perform model deployment outside: exporting data sets,
applying the model and building reports can be done in no
more than a few minutes. However, as data volume increases
exporting data from the database system becomes a bottleneck.
This problem gets compounded with results interpretation
when it is necessary to relate statistical numbers back to the
original tables in the database. Therefore, it is common to build
models outside, frequently based on samples, and then once
an acceptable model is obtained, then it is imported back into
the database system. Nowadays, model deployment basically
happens in two ways: using SQL queries if the mathematical
computations are relatively simple or with UDFs [2], [12] if
the computations are more sophisticated. In most cases, such
scoring process can work in a single table scan, providing
56. good performance.
IV. COMPARING PROCESSING OF WORKFLOWS INSIDE
AND OUTSIDE A DATABASE SYSTEM
In this section we present a qualitative and quantitative
comparison between running big data analytics workflows
inside and outside the database system. This discussion is
a summary of representative successful projects. We first
discuss a typical database environment. Second, we present a
summary of the data mining projects presenting their practical
application and the statistical and data mining techniques
used. Third, we discuss advantages and accomplishments for
each workflow running inside the database system, as well
as the main objections or concerns against such approach
(i.e. migrating external transformation code into the database
system). Fourth, we present time measurements taken from
actual projects at each organization, running big data analytics
workflows completely inside and partially outside the database
system.
57. A. Data Warehousing Environment
The environment was a data warehouse, where several
databases were already integrated into a large enterprise-wide
database. The database server was surrounded by specialized
servers performing OLAP and statistical analysis. One of those
servers was a statistical server with a fast network connection
to the database server.
First of all, an entire set of statistical language programs
were translated into SQL using Teradata data mining program,
the translator tool and customized SQL code. Second, in every
case the data sets were verified to have the same contents in
the statistical language and SQL. In most cases, the numeric
output from statistical and machine learning models was the
same, but sometimes there were slight numeric differences,
652
given variations in algorithmic improvements and advanced
parameters (e.g. epsilon for convergence, step-wise regression
58. procedures, pruning method in decision tree and so on).
B. Organizations: statistical models and business application
We now give a brief discussion about the organizations
where the statistical code migration took place. We also
discuss the specific type of data mining techniques used in
each case. Due to privacy concerns we omit discussion of
specific information about each organization, their databases
and the hardware configuration of their database system
servers. The big data analytics workflows were executed on the
organizations database servers, concurrently with other users
(analysts, managers, DBAs, and so on).
We now describe the computers processing the workflow
in more detail. The database system server was, in general,
a parallel multiprocessor computer with a large number of
CPUs, ample memory per CPU and several terabytes of
parallel disk storage in high performance RAID configurations.
On the other hand, the statistical server was generally a smaller
computer with less than 500 GB of disk space with ample
59. memory space. Statistical and data mining analysis inside the
database system was performed only with SQL. In general, a
workstation connected to each server with appropriate client
utilities. The connection to the database system was done with
ODBC. All time measurements discussed herein were taken
on 32-bit CPUs over the course of several years. Therefore,
they cannot be compared with each other and they should
only be used to understand performance gains within the same
organization.
The first organization was an insurance company. The
data mining goal involved segmenting customers into tiers
according to their profitability based on demographic data,
billing information and claims. The statistical techniques used
to determine segments involved histograms and clustering.
The final data set had about n = 300k records and d = 25
variables. There were four segments, categorizing customers
from best to worst.
The second organization was a cellular telephone service
provider. The data mining task involved predicting which
60. customers were likely to upgrade their call service package
or purchase a new handset. The default technique was logistic
regression [7] with stepwise procedure for variable selection.
The data set used for scoring had about n = 10M records and
d = 120 variables. The predicted variable was binary.
The third organization was an Internet Service Provider
(ISP). The predictive task was to detect which customers were
likely to disconnect service within a time window of a few
months, based on their demographic data, billing information
and service usage. The statistical techniques used in this case
were decision trees and logistic regression and the predicted
variable was binary. The final data set had n = 3.5M records
and d = 50 variables.
C. Database system-centric big data analytics workflows:
Users Opinion
We summarize pros and cons about running big data analyt-
ics workflows inside and outside the database system. Table I
contains a summary of outcomes. As we can see performance
to score data sets and transforming data sets are positive
61. outcomes in every case. Building the models faster turned
out not be as important because users relied on sampling to
build models and several samples were collected to tune and
test models. Since all databases and servers were within the
same firewall security was not a major concern. In general,
improving data management was not seen as major concern
because there existed a data warehouse, but users acknowledge
a “distributed” analytic environment could be a potential
management issue. We now summarize the main objections,
despite the advantages discussed above. We exclude cost as a
decisive factor to preserve anonymity of users opinion and
give an unbiased discussion. First, many users preferred a
traditional programming language like Java or C++ instead
of a set-oriented language like SQL. Second, some specialized
techniques are not available in the database system due to their
mathematical complexity; relevant examples include Support
Vector Machines, Non-linear regression and time series mod-
els. Finally, sampling is a standard mechanism to analyze large
62. data sets.
D. Workflow Processing Time Comparison
We compare workflow processing time inside the database
system using SQL and UDFs and outside the database system,
using an external server transforming exported data sets and
computing models. In general, the external server ran existing
data transformation programs developed by each organization.
On the other hand, each organization had diverse data mining
tools that analyzed flat files. We must mention the comparison
is not fair because the database system server was in general
a powerful parallel computer and the external server was a
smaller computer. Nevertheless, such setting represents a com-
mon IT environment where the fastest computer is precisely
the database system server.
We discuss tables from the database in more detail. There
were several input tables coming from a large normalized
database that were transformed and denormalized to build
data sets used by statistical or machine learning techniques.
63. In short, the input were tables and the output were tables as
well. No data sets were exported in this case: all processing
happened inside the database system. On the other hand,
analysis on the external server relied on SQL queries to extract
data from the database system, transform the data to produce
data sets in the statistical server and then building models or
scoring data sets based on a model. In general, data extraction
from the database system was performed using bulk utilities
which exported data records in blocks. Clearly, there was an
export bottleneck from the database system to the external
server.
Table II compares performance between both workflow pro-
cessing alternatives: inside and outside the database system. As
653
TABLE I
ADVANTAGES AND DISADVANTAGES ON RUNNING BIG
DATA ANALYTICS WORKFLOWS INSIDE A DATABASE
SYSTEM.
Insur. Phone ISP
64. Advantages:
Improve Workflow Management X X X
Accelerate Workflow Execution X
Prepare data sets more easily X X X
Compute models faster without sampling X
Score data sets faster X X X
Enforce database security X X
Disadvantages:
Workflow output independent X X
Workflow outside database system OK X X
Prefer program. lang. over SQL X X
Samples accelerate modeling X X
database system lacks statistical models X
Legacy code X
TABLE II
WORKFLOW PROCESSING TIME INSIDE AND OUTSIDE A
DATABASE
SYSTEM (TIME IN MINUTES).
Task outside inside
Build model:
Segmentation 2 1
Predict propensity 38 8
Predict churn 120 20
Score data set:
Segmentation 5 1
Predict propensity 150 2
Predict churn 10 1
TABLE III
TIME TO COMPUTE MODELS INSIDE THE DATABASE
SYSTEM AND TIME TO
EXPORT DATA SET (SECS).
65. n × 1000 d SQL/UDF BULK ODBC
100 8 4 17 168
100 16 5 32 311
100 32 6 63 615
100 64 8 121 1204
1000 8 40 164 1690
1000 16 51 319 3112
1000 32 62 622 6160
1000 64 78 1188 12010
introduced in Section II we distinguish two big data analytics
workflows: computing the model and scoring the model on
large data sets. The times shown in Table II include the time
to transform the data set with joins and aggregations the time
to compute the model and the time to score applying the best
model. As we can see the database system is significantly
faster. We must mention that to build the predictive models
both approaches exploited samples from a large data set. Then
the models were tuned with further samples. To score data
sets the gap is wider, highlighting the efficiency of SQL to
compute joins and aggregations to build the data set and then
computing mathematical equations on the data set. In general,
66. the main reason the external server was slower was the time
to export data from the database system. A secondary reason
was its more limited computing power.
Table III compares processing time to compute a model
inside the database system and the time to export the data
set (with a bulk utility and ODBC). In this case the database
system runs on a relatively small computer with 3.2 GHz, 4
GB on memory and 1 TB on disk. The models include PCA,
Naive Bayes and linear regression, which can be derived from
the correlation matrix [7] of the data set in a single table
scan using SQL queries and UDFs. Exporting the data set is
a bottleneck to perform data mining processing outside the
database system, regardless of how fast the external server is.
However, exporting a sample from the data set can be done
fast, but analyzing a large data set without sampling is much
faster to do inside the database system. Also, sampling can
also be exploited inside the database system.
V. RELATED WORK
67. There exist many proposals that extend SQL with data
mining functionality. Most proposals add syntax to SQL
and optimize queries using the proposed extensions. Several
techniques to execute aggregate UDFs in parallel are studied
in [9]; these ideas are currently used by modern parallel
relational database systems such as Teradata. SQL extensions
to define, query and deploy data mining models are proposed
in [11]; such extensions provide a friendly language interface
to manage data mining models. This proposal focuses on man-
aging models rather than computing them and therefore such
extensions are complementary to UDFs. Query optimization
techniques and a simple SQL syntax extension to compute
multidimensional histograms are proposed in [8], where a
multiple grouping clause is optimized.
Some related work on exploiting SQL for data manipulation
tasks includes the following. Data mining primitive operators
are proposed in [1], including an operator to pivot a table
and another one for sampling, useful to build data sets. The
68. pivot/unpivot operators are extremely useful to transpose and
transform data sets for data mining and OLAP tasks [3],
but they have not been standardized. Horizontal aggregations
were proposed to create tabular data sets [13], as required
by statistical and machine learning techniques, combining
pivoting and aggregation in one function. For the most part
research work on preparing data sets for analytic purposes in
a relational database system remains scarce. Data quality is a
fundamental aspect in a data mining application. Referential
integrity quality metrics are proposed in [14], where users can
654
isolate tables and columns with invalid foreign keys. Such
referential problems must be solved in order to build a data
set without missing information.
Mining workflow logs, a different problem, has received at-
tention [18]; the basic idea is to discover patterns in workflow
process logs. To the best of our knowledge, there is scarce
69. research work dealing with migrating data preprocessing into
a database system to improve management and processing of
big data analytics workflows.
VI. CONCLUSIONS
We presented practical issues and discussed common solu-
tions to push data preprocessing into a database system to
improve management and processing of big data analytics
workflows. It is important to emphasize data preprocessing
is generally the most time consuming and error-prone task
in a big data analytics workflow. We identified specific data
preprocessing issues. Summarization generally has to be done
at different granularity levels and such levels are generally not
hierarchical. Rows are selected based on a time window, which
requires indexes on date columns. Row selection (filtering)
with complex predicates happens on many tables, making code
maintenance and query optimization difficult. In general, it is
necessary to create a “universe” data set to define left outer
joins. Model deployment requires importing models as SQL
70. queries or UDFs to deploy a model on large data sets. Based
on experience from real-life projects, we compared advantages
and disadvantages when running big data analytics workflows
entirely inside and outside a database system. Our general
observations are the following. big data analytics workflows
are easier to manage inside a database system. A big data
analytics workflow is generally faster to run inside a database
system assuming a data warehousing environment (i.e. data
originates from the database). From a practical perspective
workflows are easier to manage inside a database system
because users can exploit the extensive capabilities of the
database system (querying, recovery, security and concurrency
control) and there is less data redundancy. However, external
statistical tools may provide more flexibility than SQL and
more statistical techniques. From an efficiency (processing
time) perspective, transforming and scoring data sets and
computing models are faster to develop and run inside the
database system. Nevertheless, sampling represent a practical
71. solution to accelerate big data analytics workflows running
outside a database system.
Improving and optimizing the management and processing
of big data analytics workflows provides many opportunities
for future work. It is necessary to specify models which take
into account processing with external data mining tools. The
data set tends to be the bottleneck in a big data analytics
workflow both from a programming and processing time
perspectives. Therefore, it is necessary to specify workflow
models which can serve as templates to prepare data sets. The
role of the statistical model in a workflow needs to studied in
the context of the source tables that were used to build the
data set; very few organizations reach that stage.
REFERENCES
[1] J. Clear, D. Dunn, B. Harvey, M.L. Heytens, and P. Lohman.
Non-stop
SQL/MX primitives for knowledge discovery. In ACM KDD
Conference,
pages 425–429, 1999.
[2] J. Cohen, B. Dolan, M. Dunlap, J. Hellerstein, and C.
Welton. MAD
72. skills: New analysis practices for big data. In Proc. VLDB
Conference,
pages 1481–1492, 2009.
[3] C. Cunningham, G. Graefe, and C.A. Galindo-Legaria.
PIVOT and
UNPIVOT: Optimization and execution strategies in an
RDBMS. In
Proc. VLDB Conference, pages 998–1009, 2004.
[4] R. Elmasri and S.B. Navathe. Fundamentals of Database
Systems.
Addison-Wesley, 4th edition, 2003.
[5] U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth. The KDD
process for
extracting useful knowledge from volumes of data.
Communications of
the ACM, 39(11):27–34, November 1996.
[6] J. Han and M. Kamber. Data Mining: Concepts and
Techniques. Morgan
Kaufmann, San Francisco, 2nd edition, 2006.
[7] T. Hastie, R. Tibshirani, and J.H. Friedman. The Elements
of Statistical
Learning. Springer, New York, 1st edition, 2001.
[8] A. Hinneburg, D. Habich, and W. Lehner. Combi-operator-
database
support for data mining applications. In Proc. VLDB
Conference, pages
429–439, 2003.
[9] M. Jaedicke and B. Mitschang. On parallel processing of
aggregate
73. and scalar functions in object-relational DBMS. In ACM
SIGMOD
Conference, pages 379–389, 1998.
[10] T.M. Mitchell. Machine Learning. Mac-Graw Hill, New
York, 1997.
[11] A. Netz, S. Chaudhuri, U. Fayyad, and J. Berhardt.
Integrating data
mining with SQL databases: OLE DB for data mining. In Proc.
IEEE
ICDE Conference, pages 379–387, 2001.
[12] C. Ordonez. Statistical model computation with UDFs.
IEEE Transac-
tions on Knowledge and Data Engineering (TKDE),
22(12):1752–1765,
2010.
[13] C. Ordonez and Z. Chen. Horizontal aggregations in SQL
to prepare
data sets for data mining analysis. IEEE Transactions on
Knowledge
and Data Engineering (TKDE), 24(4):678–691, 2012.
[14] C. Ordonez and J. Garcı́a-Garcı́a. Referential integrity
quality metrics.
Decision Support Systems Journal, 44(2):495–508, 2008.
[15] C. Ordonez and J. Garcı́a-Garcı́a. Database systems
research on data
mining. In Proc. ACM SIGMOD Conference, pages 1253–1254,
2010.
[16] D. Pyle. Data preparation for data mining. Morgan
Kaufmann
74. Publishers Inc., San Francisco, CA, USA, 1999.
[17] E. Rahm and D. Hong-Hai. Data cleaning: Problems and
current
approaches. IEEE Bulletin of the Technical Committee on Data
En-
gineering, 23(4), 2000.
[18] W. M. P. van der Aalst, B. F. van Dongen, J. Herbst, L.
Maruster,
G. Schimm, and A. J. M. M. Weijters. Workflow mining: a
survey of
issues and approaches. Data & Knowledge Engineering,
47(2):237–267,
2003.
655