SlideShare a Scribd company logo
1 of 74
Database Integrated Analytics using R: Initial
Experiences with SQL-Server + R
Josep Ll. Berral and Nicolas Poggi
Barcelona Supercomputing Center (BSC)
Universitat Politècnica de Catalunya (BarcelonaTech)
Barcelona, Spain
Abstract—Most data scientists use nowadays functional or
semi-functional languages like SQL, Scala or R to treat data,
obtained directly from databases. Such process requires to fetch
data, process it, then store again, and such process tends to
be done outside the DB, in often complex data-flows. Recently,
database service providers have decided to integrate “R-as-a-
Service” in their DB solutions. The analytics engine is called
directly from the SQL query tree, and results are returned as
part of the same query. Here we show a first taste of such
technology by testing the portability of our ALOJA-ML
analytics
framework, coded in R, to Microsoft SQL-Server 2016, one of
the SQL+R solutions released recently. In this work we discuss
some data-flow schemes for porting a local DB + analytics
engine
architecture towards Big Data, focusing specially on the new
DB Integrated Analytics approach, and commenting the first
experiences in usability and performance obtained from such
new services and capabilities.
I. INTRODUCTION
Current data mining methodologies, techniques and algo-
rithms are based in heavy data browsing, slicing and process-
ing. For data scientists, also users of analytics, the capability
of defining the data to be retrieved and the operations to be
applied over this data in an easy way is essential. This is the
reason why functional languages like SQL, Scala or R are so
popular in such fields as, although these languages allow high
level programming, they free the user from programming the
infrastructure for accessing and browsing data.
The usual trend when processing data is to fetch the data
from the source or storage (file system or relational database),
bring it into a local environment (memory, distributed workers,
...), treat it, and then store back the results. In such schema
functional language applications are used to retrieve and slice
the data, while imperative language applications are used to
process the data and manage the data-flow between systems.
In most languages and frameworks, database connection pro-
tocols like ODBC or JDBC are available to enhance this data-
flow, allowing applications to directly retrieve data from DBs.
And although most SQL-based DB services allow user-written
procedures and functions, these do not include a high variety
of primitive functions or operators.
The arrival of the Big Data favored distributed frameworks
like Apache Hadoop and Apache Spark, where the data is
distributed “in the Cloud” and the data processing can also be
distributed where the data is placed, then results are joined
and aggregated. Such technologies have the advantage of
distributed computing, but when the schema for accessing data
and using it is still the same, just that the data distribution is
transparent to the user. Still, the user is responsible of adapting
any analytics to a Map-Reduce schema, and be responsible of
the data infrastructure.
Recently, companies like Microsoft, IBM or Cisco,
providers of Analytics as a Service platforms, put special
effort into complementing their solutions by adding script-
ing mechanisms into their DB engines, allowing to embed
analytics mechanisms into the same DB environment. All of
them selected R [17] as a language and analytics engine, a
free and open-source statistical oriented language and engine,
embraced by the data mining community since long time ago.
The current paradigm of “Fetch from DB, Process, Dump to
DB” is shifted towards an “In-DB Processing” schema, so the
operations to be done on the selected data are provided by
the same DB procedures catalog. All the computation remains
inside the DB service, so the daily user can proceed by simply
querying the DB in a SQL style. New R-based procedures,
built-in or user created by invoking R scripts and libraries, are
executed as regular operations inside the query execution tree.
The idea of such integration is that, not only this processing
will be more usable by querying the DB, where the data is
managed and distributed, but also will reduce the overhead
of the data pipe-line, as everything will remain inside a
single data framework. Further, for continuous data processing,
analytics procedures and functions can be directly called from
triggers when data is continuously introduced or modified.
As a case of use of such approach, in this paper we present
some experiences on porting the ALOJA framework [13] to
the recently released SQL-Server 2016, incorporating this R-
Service functionality. The ALOJA-ML framework is a col-
lection of predictive analytics functions (machine learning
and data mining), written in R, originally purposed for mod-
eling and prediction High Performance Computing (HPC)
benchmarking workloads as part of the ALOJA Project [16],
deployed to be called after retrieving data from a MySQL
database. Such project collects traces and profiling of Big
Data framework technologies, and analyzing this data requires
predictive analytics. These procedures are deployed in R, and
communicating the framework with the database and R engine
results in a complex architecture, with a fragile data-pipeline
and susceptible to failures.
In this work we present how we could adapt our current
data-processing approach to a more Big-Data oriented archi-
2016 IEEE 16th International Conference on Data Mining
Workshops
2375-9259/16 $31.00 © 2016 IEEE
DOI 10.1109/ICDMW.2016.155
1
tecture, and how we tested using the ALOJA data-set [12], as
an example and as a review from the user’s point of view.
After testing the Microsoft version of SQL+R services [10],
we saw that the major complication, far away of deploying
the service, is to build the SQL wrapping procedures for the
R scripts to be executed. When reusing or porting already
existing code or R applications, the wrapper just has to source
(R function for “import”) the original code and its libraries,
and execute the corresponding functions (plus bridging the
parameters and the return of this function). Current services
are prepared to transform 2 dimensional R Data Frames into
tables, as a result of such procedures. Aside of Microsoft
services, we took a look into Cisco ParStream [6], displaying
a very similar approach, differing on the way of instantiating
the R scripts through R file calls instead of direct scripting.
It remains for the future work to test and compare the per-
formances among platforms, also to include some experiences
with IBM PureData Services [7] and any other new platform
providing such services.
This article is structured as follows: Section II presents
the current state-of-art and recent approaches used when
processing data from databases. Section III explains the current
and new data-flow paradigm, and required changes. Section IV
shows some code porting examples from our native R scripts to
Microsoft SQL Server. Section V provides comments and de-
tails on some experiments on running the ALOJA framework
in the new architecture. Finally, section VI summarizes this
current work and presents the conclusions and future work.
II. STATE OF THE ART
Current efforts on systems processing Big Data are mostly
focused on building and improving distributed systems. Plat-
forms like Apache Hadoop [1] and Apache Spark [4], with
all their “satellite” technologies, are on the rise on Big Data
processing environments. Those platforms, although being
originally designed towards Java or Python applications, the
significant weight of the data mining community using R,
Scala or using SQL interfaces, encouraged the platforms strate-
gists over the years to include interfaces and methodologies for
using such languages. Revolution Analytics published in 2011
RHadoop [9], a set of packages for R users to launch Hadoop
tasks. Such packages included HDFS [14] and HBase [2] han-
dlers, with Map-Reduce and data processing function libraries
adapted for Hadoop. This way, R scripts could dispatch par -
allelizable functions (e.g. R “apply” functions) to be executed
in distributed worker computing machines.
The Apache Spark platform, developed by the Berkeley’s
AMPlab [11] and Databricks [5] and released in 2014, focuses
on four main applied data science topics: graph processing,
machine learning, data streaming, and relational algebraic
queries (SQL). For these, Spark is divided in four big pack-
ages: GraphX, MLlib, SparkStream, and SparkSQL. SparkSQL
provides a library for treating data through SQL syntax or
through relational algebraic functions. Also recently, a R
scripting interface has been added to the initial Java, Python
and Scala interfaces, through the SparkR package, providing
the Spark based parallelism functions, Map-Reduce and HBase
or Hive [3] handlers. This way, R users can connect to
Spark deployments and process data frames in a Map-Reduce
manner, the same way they could do with RHadoop or using
other languages.
Being able to move the processing towards the database
side becomes a challenge, but allows to integrate analytics
into the same data management environment, letting the same
framework that receives and stores data to process it, in the
way it is configured (local, distributed...). For this purpose,
companies providing database platforms and services put effort
in adding data processing engines as integrated components to
their solutions. Microsoft recently acquired Revolution Analyt-
ics and its R engine, re-branded as R-Server [8], and connected
to the Microsoft SQL-Server 2016 release [10]. IBM also
released their platform Pure Data Systems for Analytics [7],
providing database services including the vanilla R engine
from the Comprehensive R Archive Network [17]. Also Cisco
recently acquired ParStream [6], a streaming database product
incorporating user defined functions, programmable as shared
object libraries in C++ or as external scripts in R.
Here we describe our first approach to integrate our R-based
analytics engine into a SQL+R platform, the Microsoft SQL-
Server 2016, primarily looking at the user experience, and
discussing bout cases of use where one architecture would be
preferred over others.
III. DATA-FLOW ARCHITECTURES
Here we show three basic schemes of data processing,
the local ad-hoc schema of pull-process-push data, the new
distributed schemes for Hadoop and Spark, and the “In-
DataBase” approach using the DB integrated analytics ser-
vices. Point out that there is not an universal schema that
works for each situation, and each one serves better or
worse depending on the situation. As an example we put the
case of the ALOJA-ML framework as an example of such
architectures, how it currently operates, the problematic, and
how it would be adapted to new approaches.
A. Local Environments
In architectures where the data and processing capacity is in
the same location, having data-sets stored in local file systems
or direct access DBs (local or remote databases where the
user or the application can access to retrieve or put data).
When an analytics application wants to process data, it can
access the DB to fetch the required data, store locally and
then pass to the analytics engine. Results are collected by
the application and pushed again to the DB, if needed. This
is a classical schema before having distributed systems with
distributed databases, or for systems where the processing is
not considered big enough to build a distributed computing-
power environment, or when the process to be applied on data
cannot be distributed.
For systems where the analytics application is just a user
of the data, this schema might be the corresponding one, as
the application fetches the data it is granted to view, then
2
do whatever it wants with it. Also for applications where
computation does not require select big amounts of data, as
the data required is a small fraction of the total Big Data, it is
affordable to fetch the slice of data and process it locally. Also
on systems where using libraries like snowfall [X], letting the
user to tune the parallelism of R functions, can be deployed
locally up to a point where this distribution requires heavier
mechanisms like Hadoop or Spark.
There are mechanisms and protocols, like ODBC or JDBC
(e.g. RODBC library for R), allowing applications to access
directly to DBs and fetch data without passing through the file
system, but through the application memory space. This is, in
case that the analytics application has granted direct access to
the DB and understands the returning format. Figure 1 shows
both archetypical local schemes.
Fig. 1. Schema of local execution approaches, passing data
through the File
System or ODBC mechanisms
In the first case of Figure 1 we suppose an scenario where
the engine has no direct access to DB, as everything that
goes into analytics is triggered by the base application. Data
is retrieved and pre-processed, then piped to the analytics
engine through the file system (files, pipes, ...). This requires
coordination between application and engine, in order to
communicate the data properly. Such scenarios can happen
when the DB, the application and the engine do not belong
to the same organization, or when they belong to services
from different providers. Also it can happen when security
issues arise, as the analytics can be provided by a not-so-
trusted provider, and data must be anonymized before exiting
the DB.
Also, if the DB and analytics engine are provided by the
same service provider, there will be means to communicate the
storage with the engine in an ODBC or other way, allowing
to fetch the data, process, and then return the results to the
DB. As an example, the Microsoft Azure-ML services [15]
allow the connection of its machine learning components with
the Azure storage service, also the inclusion of R code as
part of user-defined components. In such service, the machine
learning operations do not happen on the storage service, but
data is pulled into the engine and then processed.
At this moment, the ALOJA Project, consists of a web user
interface (front-end and back-end), a MySQL database and
the R engine for analytics, used this schema of data pipe-line.
The web interface provided the user the requested data from
the ALOJA database, and then displayed in the front-end. If
analytics were required, the back-end dumped the data to be
treated into the file system, and then started the R engine.
Results were collected by the back-end and displayed (also
stored in cache). As most of the information passed through
user filters, this data pipe-lining was preferred over the direct
connection of the scripts with the DB. Also this maintained the
engine independent of the DB queries in constant development
at the main framework.
Next steps for the project are planned towards incorporating
a version of the required analytics into the DB, so the back-
end of the platform can query any of the provided analytics,
coded as generic for any kind of input data, as a single SQL
query.
B. Distributed Environments
An alternative architecture for the project would be to
upload the data into a Distributed Database (DDB), thinking
on expanding the ALOJA database towards Big Data (we still
have Terabytes of data to be unpacked into the DB and to be
analyzed at this time). The storage of the data could be done
using Hive or HBase technologies, also the analytics could be
adapted towards Hadoop or Spark. For Hadoop, the package
RHadoop could handle the analytics, while the SparkSQL +
SparkR packages could do the same for Spark. Processing
the analytics could be done on a distributed system with
worker nodes, as most of the analytics in our platform can be
parallelized. Further, Hadoop and Spark have machine learning
libraries (Mahout and MLlib) that could be used as native
instead of some functionalities of our ALOJA-ML framework.
Figure 2 shows the execution schema of this approach.
Such approach would concentrate all the data retrieval
and processing in a distributed system, not necessarily near
the application but providing parallelism and agility on data
browsing, slicing and processing. However, this requires a
complex set-up for all the involved services and machines,
and the adaption of the analytics towards parallelism in a
different level than when using snowfall or other packages.
SparkR provides a Distributed Data-Frame structure, with
similar properties than regular R data-frames, but due to the
distributed property, some classic operations are not available
3
Fig. 2. Schema of distributed execution approach, with
Distributed Databases
or File Systems
or must be performed in a different way (e.g. column binding,
aggregates like “mean”, “count”...).
This option would be chosen at that point where data is
large enough to be dealt with a single system, considering that
DB managers do not provide means to distribute and retrieve
data. Also, like in the previous approach, the data pipe-line
passes through calling the analytics engine to produce the
queries and invoke (in a Map-Reduce manner) the analytics
when parallelizable, or collect locally data to apply the non-
parallelizable analytics. The price to be paid would be the set-
up of the platform, and the proper adjustment of the analytics
towards a Map-Reduce schema.
C. Integrated Analytics
The second alternative to the local analytics schema is to
incorporate those functions to the DB services. At this point,
the DB can offer Analytics as a Service, as users can query
for analytics directly to the DB in a SQL manner, and data
will be directly retrieved, processed and stored by the same
framework, avoiding the user to plan a full data pipe-line.
As the presented services and products seem to admit R
scripts and programs directly, no adaption of the R code
is required. The effort must be put in coding the wrapping
procedures for being called from a SQL interface. Figure 3
shows the basic data-flow for this option.
Fig. 3. Schema of “In-DB” execution approach
The advantages of such approach are that 1) analytics, data
and data management are integrated in the same software
package, and the only deployment and configuration is of
the database; 2) the management of distributed data in Big
Data deployments will provided by the same DB software
(whether this is implemented as part of the service!), and the
user doesn’t need to implement Map-Reduce functions but data
can be aggregated in the same SQL query; 3) simple DB users
can invoke the analytics though a SQL query, also analytics
can be programmed generically to be applied with any required
input SQL query result; 4) DBs providing triggers can produce
Continuous Analytic Queries each time data is introduced or
modified.
This option still has issues to be managed, such as the capac-
ity of optimization and parallelism of the embedded scripts. In
the case of the Microsoft R Service, any R code can be inserted
inside a procedure, without any apparent optimization to be
applied as any script is directly sent to a external R sub-service
managed by the principal DB service. Such R sub-service
promises to apply multi-threading in Enterprise editions, as the
classic R engine is single-thread, and multi-threading could
be applied to vectorization functions like “apply”, without
having to load the previously mentioned snowfall (loadable
as the script runs on a compatible R engine). Also, comparing
this approach to the distributed computing system, if the
service supports “partitioning” of data according to determined
columns and values, SQL queries could be distributed in a
cluster of machines (not as replication, but as distribution),
and aggregated after that.
IV. ADAPTING AND EMBEDDING CODE
According to the SQL-Server published documentation and
API, the principal way to introduce external user-defined
scripts is to wrap the R code inside a Procedure. The SQL-
Server procedures admit the script, that like any “Rscript” can
source external R files and load libraries (previously copied
into the corresponding R library path set up by the Microsoft
R Service), also admit the parameters to be bridged towards
the script, and also admit SQL queries to be executed previous
to start the script for filling input data-frames. The procedure
also defines the return values, in the form of data-frames as
tables, values or a tuple of values.
When the wrapping procedure is called, input SQL queries
are executed and passed as input data-frame parameters, direct
parameters are also passed to the script, then the script is
executed in the R engine as it would do a “Rscript” or a script
dumped into the R command line interface. The variables
mapped as outputs are returned from the procedure into the
invoking SQL query.
Figure 4 shows an example of calling ALOJA-ML functions
in charge of learning a linear model from the ALOJA data-
set, read from a file while indicating the input and output
variables; also calling a function to predict a data-set using
the previously created model. The training process creates a
model and its hash ID, then the prediction process applies the
model to all the testing data-set. In the current ALOJA data
pipe-line, “ds” should be retrieved from the DB (to a file to
be read, or directly to a variable if RODBC is used). Then
4
source("functions.r");
library(digest);
## Training Process
model <- aloja_linreg(ds = read.table("aloja6.csv"), vin =
c("maps","iofilebuf"), vout = "exe_time");
id_hash <- digest(x = model, algo = "md5"); # ID for storing the
model in DB
## Prediction Example
predictions <- aloja_predict_dataset(learned_model = model, ds
= read.table("aloja6test.csv"));
Fig. 4. Example of modeling and prediction example using the
ALOJA-ML libraries. Loading ALOJA-ML functions allow the
code to execute “aloja linreg”
to model a linear regression, also “aloja predict dataset” to
process a new data-set using a previously trained model.
the results (“model”, “id hash” and “predictions”) should be
reintroduced to the DB using again RODBC, or writing the
results into a file and the model into a serialized R object file.
Figures 5 and 6 show how this R code is wrapped as
procedure, and examples of how these procedures are invoked.
Using this schema, in the modeling procedure, “ds” would be
passed to the procedure as a SQL query, “vin” and “vout”
would be bridged parameters, and the “model” and “id hash”
would be returned as a two values (a serialized blob/string
and a string) that can be saved into a models table. The return
would be introduced into the models table. Also the prediction
procedure would admit an “id hash” for retrieving the model
(using a SQL query inside the procedure), and would return a
data-frame/table with the row IDs and the predictions.
We observed that, when preparing the embedded script, all
sources, libraries and file paths must be prepared like a Rscript
to be executed from a command line. The environment for the
script must be set-up, as it will be executed each time in a
new R session.
In the usual work-flow on ALOJA-ML tools, models (R
serialized objects) are stored in the file system, and then
uploaded in the database as binary blobs. Working directly
from the DB server allow to directly encode serialized objects
into available DB formats. Although SQL-Server includes a
“large binary” data format, we found some problems when
returning binary information from procedures (syntax not
allowing the return of such data type in tuples), thus serialized
objects can be converted to text-formats like base64 to be
stored as a “variable size character array”.
V. EXPERIMENTS AND EXPERIENCES
A. Usability and Performance
For testing the approach we used the SQL-Server 2016
Basic version, with the R Service installed and with Windows
Server 2012, from a default image available at the Azure
repositories. Being the basic service, and not the enterprise,
we assumed that R Services would not provide improvements
on multi-threading, so additional packages for such functions
are installed (snowfall). The data from the ALOJA data-set is
imported through the CSV importing tools in the SQL-Server
Manager framework, the “Visual-Studio”-like integrated de-
velopment environment (IDE) for managing the services.
Data importing displayed some problems, concerning to
data-type transformation issues, as some numeric columns
couldn’t be properly imported due to precision and very big
values, and had to be treated as varchar (thus, not treatable
as number). After making sure that the CSV data is properly
imported, the IDE allowed to perform simple queries (SE-
LECT, INSERT, UPDATE). At this point, tables for storing
pair-value entries, containing the models with their specific
ID hash as key, is created. Procedures wrapping the basic
available functions of ALOJA-ML are created, as specified
in previous section IV. After executing some example calls,
the system works as expected.
During the process of creating the wrapping procedures
we found some issues, probably non-reported bugs or system
internal limitations, like the fact that a procedure can return a
large binary type (large blob in MySQL and similar solutions),
also can return tuples of diverse kinds of data types, but it
crashed with an internal error when trying to return a tuple
of a varchar and a large binary. Workarounds were found
by converting the serialized model object (binary type) into
a base64 encoding string (varchar type), to be stored with
its ID hash key. As none information about this issue was
found in the documentation, at the day of registering these
experiences, we expect that such issues will be solved by the
development team in the future.
We initially did some tests using the modeling and pre-
diction ALOJA-ML functions over the data, and comparing
times with a local “vanilla” R setup, performance is almost
identical. This indicating that with this “basic” version, the R
Server (former R from Revolution Analytics) is still the same
at this point.
Another test done was to run the outliers classification
function of our framework. The function, explained in detail
in its corresponding work [13], compares the original output
variables with their predictions, and if the difference is k times
greater than the expected standard deviation plus modeling er -
ror, and it doesn’t have enough support from similar values on
the rest of the data-set, such data is considered an outlier. This
implies the constant reading of the data table for prediction
and for comparisons between entries. The performance results
were similar to the ones on a local execution environment,
measuring only the time spent in the function. In a HDI-
A2 instance (2 virtual core, only 1 used, 3.5GB memory),
5
%% Creation of the Training Procedure, wrapping the R call
CREATE PROCEDURE dbo.MLPredictTrain @inquery
nvarchar(max), @varin nvarchar(max),
@varout nvarchar(max) AS
BEGIN
EXECUTE sp_execute_external_script
@language = N’R’,
@script = N’
source("functions.r");
library(digest); library(base64enc);
model <- aloja_linreg(ds = InputDataSet, vin =
unlist(strsplit(vin,",")), vout = vout);
serial <- as.raw(serialize(model, NULL));
OutputDataSet <- data.frame(model = base64encode(serial),
id_hash = digest(serial, algo = "md5"));
’,
@input_data_1 = @inquery,
@input_data_1_name = N’InputDataSet’,
@output_data_1_name = N’OutputDataSet’,
@params = N’@vin nvarchar(max), @vout nvarchar(max)’,
@vin = @varin,
@vout = @varout
WITH RESULT SETS (("model" nvarchar(max), "id_hash"
nvarchar(50)));
END
%% Example of creating a model and storing into the DB
INSERT INTO aloja.dbo.trained_models (model, id_hash)
EXEC dbo.MLPredictTrain @inquery = "SELECT exe_time,
maps, iofilebuf FROM aloja.dbo.aloja6",
@varin = "maps,iofilebuf", @varout = "exe_time"
Fig. 5. Version of the modeling call for ALOJA-ML functions in
a SQL-Server procedure. The procedure generates the data-set
for “aloja linreg” from a
parametrized query, also bridges the rest of parameters into the
script. It also indicates the format of the output, being a value, a
tuple or a table (data frame).
%% Creation of the Predicting Procedure, wrapping the R call
CREATE PROCEDURE dbo.MLPredict @inquery
nvarchar(max), @id_hash nvarchar(max) AS
BEGIN
DECLARE @modelt nvarchar(max) = (SELECT TOP 1 model
FROM aloja.dbo.trained_models
WHERE id_hash = @id_hash);
EXECUTE sp_execute_external_script
@language = N’R’,
@script = N’
source("functions.r");
library(base64enc);
results <- aloja_predict_dataset(learned_model =
unserialize(as.raw(base64decode(model))),
ds = InputDataSet);
OutputDataSet <- data.frame(results);
’,
@input_data_1 = @inquery,
@input_data_1_name = N’InputDataSet’,
@output_data_1_name = N’OutputDataSet’,
@params = N’@model nvarchar(max)’,
@model = @modelt;
END
%% Example of predicting a dataset from a SQL query with a
previously trained model in DB
EXEC aloja.dbo.MLPredict @inquery = ’SELECT exe_time,
maps, iofilebuf FROM aloja.dbo.aloja6test’,
@id_hash = ’aa0279e9d32a2858ade992ab1de8f82e’;
Fig. 6. Version of the Prediction call for ALOJA-ML functions
in a SQL-Server procedure. Like the training procedure in
figure 5, the procedure primarily
retrieves the data to be processed from a SQL query, and passes
it with the rest of parameters into the script. Here the result is
directly a table (data frame).
it took 1h:9m:56s to process the 33147 rows, selecting just 3
features. Then, as a way to improve the performance, due to
the limitations of the single-thread R Server, we loaded snow-
fall, and invoked it from the ALOJA-ML “outlier dataset”
function, on a HDI-A8 instance (8 virtual core, all used, 14GB
memory). The data-set was processed in 11m:4s, barely 1/7
6
of the previous time, considering the overhead of sharing data
among R processes created by snowfall, demonstrating that
despite not being a multi-threaded set-up, using the traditional
resources available on R it is possible to scale R procedures.
B. Discussion
One of the concerns on the usage of such service is, despite
and because of the capability of multi-processing using built-
in or loaded libraries, the management of the pool of R
processes. R is not just a scripting language to be embedded
on a procedure, but it is a high-level language that allows from
creating system calls to parallelizing work among networked
working nodes. Given the complexity that a R user created
function can achieve, in those cases that such procedure is
heavily requested, the R server should be able to be detached
from the SQL-server and able in dedicated HPC deployments.
The same way snowfall can be deployed for multi-threading
(also for cluster-computing), clever hacks can be created by
loading RHadoop or SparkR inside a procedure, connecting
the script with a distributed processing system. As the SQL-
server bridges tables and query results as R data frames, such
data frames can be converted to Hadoop’s Resilient Distributed
Data-sets or Spark’s Distributed Data Frames, uploaded to a
HDFS, processed, then returned to the database. This could
bring to a new architecture of SQL-Server (or equivalent
solutions) to connect to distributed processing environments,
as slave HPC workers for the database. Also an improvement
could be that the same DB-server, instead of producing input
table/data frames already returned Distributed Data Frames,
being the data base distributed into working nodes (in a
partitioning way, not a replication way). All in all, the fact
that the embedded R code is directly passed to a nearly-
independent R engine allows to do whatever a data scientist
can do with a typical R session.
VI. SUMMARY AND CONCLUSIONS
The incorporation of R, the semi-functional programming
statistical language, as an embedded analytics service into
databases will suppose an improvement on the ease and
usability of analytics over any kind of data, from regular to
big amounts (Big Data). The capability of data scientists to
introduce their analytics functions as a procedure in databases,
avoiding complex data-flows from the DB to analytics engines,
allow users and experts a quick tool for treating data in-situ
and continuously.
This study discussed some different architectures for data
processing, involving fetching data from DBs, distributing data
and processing power, and embedding the data process into the
DB. All of this using the ALOJA-ML framework as reference,
a framework written in R dedicated to model, predict and
classify data from Hadoop executions, stored as the ALOJA
data-set. The shown examples and cases of use correspond
to the port of the current ALOJA architecture towards SQL-
Server 2016, integrating R Services.
After testing the analytics functions after the porting into
an SQL database, we observed that the major effort for this
porting are in the wrapping SQL structures for incorporating
the R calls into the DB, without modifying the original R code.
As performance results similar than R standalone distributions,
the advantages come from the input data retrieving and storing.
In future work we plan to test the system more in-depth, and
also compare different SQL+R solutions, as companies offer -
ing DB products have started putting efforts into integrating
R engines in their DB platforms. As this study focused more
in an initial hands-on with this new technology, future studies
will focus more on comparing performance, also against other
architectures for processing Big Data.
ACKNOWLEDGMENTS
This project has received funding from the European Research
Council (ERC) under the European Union’s Horizon 2020
research
and innovation programme (grant agreement No 639595).
REFERENCES
[1] Apache Hadoop. http://hadoop.apache.org (Aug 2016).
[2] Apache HBase. https://hbase.apache.org/ (Aug 2016).
[3] Apache Hive. https://hive.apache.org/ (Aug 2016).
[4] Apache Spark. https://spark.apache.org/ (Aug 2016).
[5] Databricks inc. https://databricks.com/ (Aug 2016).
[6] ParStream. Cisco corporation.
http://www.cisco.com/c/en/us/products/
analytics-automation-software/parstream/index.html (Aug
2016).
[7] PureData Systems for Analytics. IBM corporation.
https://www-01.ibm.
com/software/data/puredata/analytics/ (Aug 2016).
[8] R-Server. Microsoft corporation.
https://www.microsoft.com/en-us/
cloud-platform/r-server (Aug 2016).
[9] RHadoop. Revolution Analytics. https://github.com/
RevolutionAnalytics/RHadoop/wiki (Aug 2016).
[10] SQL-Server 2016. Microsoft corporation.
https://www.microsoft.com/
en-us/cloud-platform/sql-server (Aug 2016).
[11] UC Berkeley, AMPlab. https://amplab.cs.berkeley.edu/
(Aug 2016).
[12] Barcelona Supercomputing Center. ALOJA home page.
http://aloja.bsc.
es/ (Aug 2016).
[13] J. L. Berral, N. Poggi, D. Carrera, A. Call, R. Reinauer,
and D. Green.
ALOJA-ML: A framework for automating characterization and
knowl-
edge discovery in hadoop deployments. In Proceedings of the
21th ACM
SIGKDD International Conference on Knowledge Discovery and
Data
Mining, Sydney, NSW, Australia, August 10-13, 2015, pages
1701–1710,
2015.
[14] D. Borthakur. The Hadoop Distributed File System:
Architecture and
Design.
http://hadoop.apache.org/docs/r0.18.0/hdfs design.pdf. The
Apache
Software Foundation, 2007.
[15] Microsoft Corporation. Azure 4 Research.
http://research.microsoft.
com/en-us/projects/azure/default.aspx (Jan 2016).
[16] N. Poggi, J. L. Berral, D. Carrera, A. Call, F. Gagliardi, R.
Reinauer,
N. Vujic, D. Green, and J. A. Blakeley. From performance
profiling to
predictive analytics while evaluating hadoop cost-efficiency in
ALOJA.
In 2015 IEEE International Conference on Big Data, Big Data
2015,
Santa Clara, CA, USA, October 29 - November 1, 2015, pages
1220–
1229, 2015.
[17] R Core Team. R: A Language and Environment for
Statistical Comput-
ing. R Foundation for Statistical Computing, Vienna, Austria,
2014.
7
Managing Big Data Analytics Workflows
with a Database System
Carlos Ordonez Javier Garcı́a-Garcı́a
University of Houston UNAM University
USA Mexico
Abstract—A big data analytics workflow is long and complex,
with many programs, tools and scripts interacting together. In
general, in modern organizations there is a significant amount
of big data analytics processing performed outside a database
system, which creates many issues to manage and process big
data analytics workflows. In general, data preprocessing is the
most time-consuming task in a big data analytics workflow. In
this work, we defend the idea of preprocessing, computing
models
and scoring data sets inside a database system. In addition, we
discuss recommendations and experiences to improve big data
analytics workflows by pushing data preprocessing (i.e. data
cleaning, aggregation and column transformation) into a
database
system. We present a discussion of practical issues and common
solutions when transforming and preparing data sets to improve
big data analytics workflows. As a case study validation, based
on
experience from real-life big data analytics projects, we
compare
pros and cons between running big data analytics workflows
inside and outside the database system. We highlight which
tasks
in a big data analytics workflow are easier to manage and faster
when processed by the database system, compared to external
processing.
I. INTRODUCTION
In a modern organization transaction processing [4] and data
warehousing [6] are managed by database systems. On the
other hand, big data analytics projects are a different story.
Despite existing data mining [5], [6] functionality offered by
most database systems many statistical tasks are generally
performed outside the database system [7], [6], [15]. This
is due to the existence of sophisticated statistical tools and
libraries, the lack of expertise of statistical analysts to write
correct and efficient SQL queries, a somewhat limited set of
statistical algorithms and techniques in the database system
(compared to statistical packages) and abundance of legacy
code (generally well tuned and difficult to rewrite in a different
language). In such scenario, users exploit the database system
just to extract data with ad-hoc SQL queries. Once large tables
are exported they are summarized and further transformed
depending on the task at hand, but outside the database system.
Finally, when the data set has the desired variables (features),
statistical models are computed, interpreted and tuned. In
general, preprocessing data for analysis is the most time
consuming task [5], [16], [17] because it requires cleaning,
joining, aggregating and transforming files, tables to obtain
“analytic” data sets appropriate for cube or machine learning
processing. Unfortunately, in such environments manipulating
data sets outside the database system creates many data man-
agement issues: data sets must be recreated and re-exported
every time there is a change, models need to be imported
back into the database system and then deployed, different
users may have inconsistent versions of the same data set in
their own computers, security is compromised. Therefore, we
defend the thesis that it is better to transform data sets and
compute statistical models inside the database system. We
provide evidence such approach improves management and
processing of big data analytics workflows. We argue database
system-centric big data analytics workflows are a promising
processing approach for large organizations.
The experiences reported in this work are based on analytics
projects where the first author participated as as developer
and consultant in a major DBMS company. Based on ex-
perience from big data analytics projects we have become
aware statistical analysts cannot write correct and efficient
SQL code to extract data from the database system or to
transform their data sets for the big data analytics task at hand.
To overcome such limitations, statistical analysts use big data
analytics tools, statistical software, data mining packages to
manipulate and transform their data sets. Consequently, users
end up creating a complicated collection of programs mixing
data transformation and statistical modeling tasks together. In
a typical big data analytics workflow most of the development
(programming) effort is spent on transforming the data set: this
is the main aspect studied in this article. The data set represents
the core element in a big data analytics workflow: all data
transformation and modeling tasks must work on the entire
data set. We present evidence running workflows entirely
inside a database system improves workflow management and
accelerates workflow processing.
The article is organized as follows. Section II provides
the specification of the most common big data analytics
workflows. Section III discusses practical issues and solutions
when preparing and transforming data sets inside a database
system. We contrast processing workflows inside and outside
the database system. Section IV compares advantages and
time performance of processing big data analytics workflows
inside the database system and outside exploiting with external
data mining tools. Such comparisons are based on experience
from real-life data mining projects. Section V discusses related
work. Section VI presents conclusions and research issues for
2016 16th IEEE/ACM International Symposium on Cluster,
Cloud, and Grid Computing
978-1-5090-2453-7/16 $31.00 © 2016 IEEE
DOI 10.1109/CCGrid.2016.63
649
future work.
II. BIG DATA ANALYTICS WORKFLOWS
We consider big data analytics workflows in which Task
i must be finished before Task i + 1. Processing is mostly
sequential from task to task, but it is feasible to have cycles
from Task i back to Task 1 because data mining is an iterative
process. We distinguish two complementary big data analytics
workflows: (1) computing statistical models; (2) scoring data
sets based on a model. In general, statistical models are
computed on a new data set, whereas scoring takes place after
the model is well understood and an acceptable data mining
has been produced.
A typical big data analytics workflow to compute a model
consists of the following major tasks (this basic workflow can
vary depending on users knowledge and focus):
1) Data preprocessing: Building data sets for analysis (se-
lecting records, denormalization and aggregation)
2) Exploratory analysis (descriptive statistics, OLAP)
3) Computing statistical models
4) Tuning models: add, modify data set attributes, fea-
ture/variable selection, train/test models
5) Create scoring scripts based on selected features and best
model
This modeling workflow tends to be linear (Task i executed
before Task i+1), but in general it has cycles. This is because
there are many iterations building data sets, analyzing variable
behavior and computing models before and acceptable can
be obtained. Initially the bottleneck of the workflow is data
preprocessing. Later, when the data mining project evolves
most workflow processing is testing and tuning the desired
model. In general, the database system performs some denor -
malization and aggregation, but it is common to perform most
transformations outside the database system. When features
need to be added or modified queries need to be recomputed,
going back to Task 1. Therefore, only a part of the first task
of the workflow is computed inside the database system; the
remaining stages of the workflow are computed with external
data mining tools.
The second workflow we consider is actually deploying a
model with arriving (new) data. This task is called scoring [7],
[12]. Since the best features and the best model are already
known this is generally processed inside the database system.
A scoring workflow is linear and non-iterative: Processing
starts with Task 1, Task i is executed before Task i + 1 and
processing ends with the last task. In a scoring workflow tasks
are as follows:
1) Data preprocessing: Building data set for scoring (select
records, compute best features).
2) Deploy model on test data set (e.g predict class or target
value), store model output per record as a new table or
view in the database system.
3) Evaluate queries and produce summary reports on scored
data sets joined with source tables. Explain models by
tracing data back to source tables in the database system.
Building the data set tends to be most important task in both
workflows since it requires gathering data from many source
tables. In this work, we defend the idea of preprocessing big
data inside a database system. This approach makes sense only
when the raw data originates from a data warehouse. That
is, we do not tackle the problem of migrating processing of
external data sets into the database system.
III. TASKS TO PREPARE DATA SETS
We discuss issues and recommendations to improve the data
processing stage in a big data analytics workflow. Our main
motivation is to bypass the use of external tools to perform
data preprocessing, using the SQL language instead.
A. Reasons for Processing big data analytics workflows Out-
side the database system
In general, the main reason is of a practical nature: users
do not want to translate existing code working on external
tools. Commonly such code has existed for a long time
(legacy programs), it is extensive (there exist many programs)
and it has been debugged and tuned. Therefore, users are
reluctant to rewrite it in a different language, given associated
risks. A second complaint is that, in general, the database
system provides elementary statistical functionality, compared
to sophisticated statistical packages. Nevertheless, such gap
has been shrinking over the years. As explained before, in a
data mining environment most user time is spent on preparing
data sets for analytic purposes. We discuss some of the most
important database issues when tables are manipulated and
transformed to prepare a data set for data mining or statistical
analysis.
B. Main Data Preprocessing Tasks
We consider three major tasks:
1) selecting records for analysis (filtering)
2) denormalization (including math transformation)
3) aggregation
Selecting Records for Analysis: Selection of rows can be
done at several stages, in different tables. Such filtering is done
to discard outliers, to discard records with a significant missing
information content (including referential integrity [14]), to
discard records whose potential contribution to the model
provides no insight or sets of records whose characteristics
deserve separate analysis. It is well known that pushing
selection is the basic strategy to accelerate SPJ (select-project-
join) queries, but it is not straightforward to apply into multiple
queries. A common solution we have used is to perform as
much filtering as possible on one data set. This makes code
maintenance easier and the query optimizer is able to exploit
filtering predicates as much as possible.
In general, for a data mining task it is necessary to select
a set of records from one of the largest tables in the database
based on a date range. In general, this selection requires a
scan on the entire table, which is slow. When there is a
secondary index based on date it is more efficient to select
rows, but it is not always available. The basic issue is that such
650
transaction table is much larger compared to other tables in the
database. For instance, this time window defines a set of active
customers or bank accounts that have recent activity. Common
solutions to this problem include creating materialized views
(avoiding join recomputation on large tables) and lean tables
with primary keys of the object being analyzed (record,
product, etc) to act as filter in further data preparation tasks.
Denormalization: In general, it is necessary to gather data
from many tables and store data elements in one place.
It is well-known that on-line transaction processing (OLTP)
database systems update normalized tables. Normalization
makes transaction processing faster and ACID [4] semantics
are easier to ensure. Queries that retrieve a few records
from normalized tables are relatively fast. On the other hand,
analysis on the database requires precisely the opposite: a large
set of records is required and such records gather information
from many tables. Such processing typically involves complex
queries involving joins and aggregations. Therefore, normal -
ization works against efficiently building data sets for analytic
purposes. One solution is to keep a few key denormalized
tables from which specialized tables can be built. In general,
such tables cannot be dynamically maintained because they
involve join computation with large tables. Therefore, they
are periodically recreated as a batch process. The query
shown below builds a denormalized table from which several
aggregations can be computed (e.g. similar to cube queries).
SELECT
customer_id
,customer_name
,product.product_id
,product_name
,department.department_id
,department_name
FROM sales
JOIN product
ON sales.product_id=product.product_id
JOIN department
ON product.department_id
=department.department_id;
For analytic purposes it is always best to use as much
data as possible. There are strong reasons for this. Statistical
models are more reliable, it is easier to deal with missing
information, skewed distributions, discover outliers and so on,
when there is a large data set at hand. In a large database
with tables coming from a normalized database being joined
with tables used in the past for analytic purposes may involve
joins with records whose foreign keys may not be found
in some table. That is, natural joins may discard potentially
useful records. The net effect of this issue is that the resulting
data set does not include all potential objects (e.g. records,
products). The solution is define a universe data set containing
all objects gathered with union from all tables and then use
such table as the fundamental table to perform outer joins. For
simplicity and elegance, left outer joins are preferred. Then
left outer joins are propagated everywhere in data preparation
and completeness of records is guaranteed. In general such
left outer joins have a “star” form on the joining conditions,
where the primary key of the master table is left joined with
the primary keys of the other tables, instead of joining them
with chained conditions (FK of table T1 is joined with PK
of table T2, FK of table T2 is joined with PK of T3, and so
on). The query below computes a global data set built from
individual data sets. Notice data sets may or may not overlap
each other.
SELECT
,record_id
,T1.A1
,T2.A2
..
,Tk.Ak
FROM T_0
LEFT JOIN T1 ON T_0.record_id= T1.record_id
LEFT JOIN T2 ON T_0.record_id= T2.record_id
..
LEFT JOIN Tk ON T_0.record_id= Tk.record_id;
Aggregation: Unfortunately, most data mining tasks require
dimensions (variables) that are not readily available from
the database. Such dimensions typically require computing
aggregations at several granularity levels. This is because most
columns required by statistical or machine learning techniques
require measures (or metrics), which translate as sums or
counts computed with SQL. Unfortunatel y, granularity levels
are not hierarchical (like cubes or OLAP [4]) making the use
of separate summary tables necessary (e.g. summarization by
product or by customer, in a retail database). A straightforward
optimization is to compute as many dimensions in the same
statement exploiting the same group-by clause, when possible.
In general, for a statistical analyst it is best to create as many
variables (dimensions) as possible in order to isolate those
that can help build a more accurate model. Then summariza-
tion tends to create tables with hundreds of columns, which
make query processing slower. However, most state-of-the-
art statistical and machine learning techniques are designed
to perform variable (feature) selection [7], [10] and many of
those columns end up being discarded.
A typical query to derive dimensions from a transaction
table is as follows:
SELECT
customer_id
,count(*) AS cntItems
,sum(salesAmt) AS totalSales
,sum(case when salesAmt<0 then 1 end)
AS cntReturns
FROM sales
GROUP BY customer_id;
Transaction tables generally have two or even more levels
of detail, sharing some columns in their primary key. The
typical example is store transaction table with individual
items and the total count of items and total amount paid.
This means that many times it is not possible to perform a
statistical analysis only from one table. There may be unique
pieces of information at each level. Therefore, such large
651
tables need to be joined with each other and then aggregated
at the appropriate granularity level, depending on the data
mining task at hand. In general, such queries are optimized
by indexing both tables on their common columns so that
hash-joins can be used.
Combining Denormalization and Aggregation: Multiple pri-
mary keys: A last aspect is considering aggregation and
denormalization interacting together when there are multiple
primary keys. In a large database different subsets of tables
have different primary keys. In other words, such tables are
not compatible with each other to perform further aggregation.
The key issue is that at some point large tables with different
primary keys must be joined and summarized. Join operations
will be slow because indexing involves foreign keys with large
cardinalities. Two solutions are common: creating a secondary
index on the alternative primary key of the largest table, or
creating a denormalized table having both primary keys in
order to enable fast join processing.
For instance, consider a data mining project in a bank that
requires analysis by customer id, but also account id. One
customer may have multiple accounts. An account may have
multiple account holders. Joining and manipulating such tables
is challenging given their sizes.
C. Workflow Processing
Dependent SQL statements: A data transformation script is
a long sequence of SELECT statements. Their dependencies
are complex, although there exists a partial order defined by
the order in which temporary tables and data sets for analysis
are created. To debug SQL code it is a bad idea to create
a single query with multiple query blocks. In other words,
such SQL statements are not amenable to the query optimizer
because they are separate, unless it can keep track of historic
usage patterns of queries. A common solution is to create
intermediate tables that can be shared by several statements.
Those intermediate tables commonly have columns that can
later be aggregated at the appropriate granularity levels.
Computer resource usage: This aspect includes both disk
and CPU usage, with the second one being a more valuable
resource. This problem gets compounded by the fact that
most data mining tasks work on the entire data set or large
subsets from it. In an active database environment running
data preparation tasks during peak usage hours can degrade
performance since, generally speaking, large tables are read
and large tables are created. Therefore, it is necessary to
use workload management tools to optimize queries from
several users together. In general, the solution is to give data
preparation tasks a lower priority than the priority for queries
from interactive users. On a longer term strategy, it is best
to organize data mining projects around common data sets,
but such goal is difficult to reach given the mathematical
nature of analysis and the ever changing nature of variables
(dimensions) in the data sets.
Comparing views and temporary tables: Views provide
limited control on storage and indexing. It may be better
to create temporary tables, especially when there are many
primary keys used in summarization. Nevertheless, disk space
usage grows fast and such tables/views need to be refreshed
when new records are inserted or new variables (dimensions)
are created.
Scoring: Even though many models are built outside the
database system with statistical packages and data mining
tools, in the end the model must be applied inside the database
system [6]. When volumes of data are not large it is feasible
to perform model deployment outside: exporting data sets,
applying the model and building reports can be done in no
more than a few minutes. However, as data volume increases
exporting data from the database system becomes a bottleneck.
This problem gets compounded with results interpretation
when it is necessary to relate statistical numbers back to the
original tables in the database. Therefore, it is common to build
models outside, frequently based on samples, and then once
an acceptable model is obtained, then it is imported back into
the database system. Nowadays, model deployment basically
happens in two ways: using SQL queries if the mathematical
computations are relatively simple or with UDFs [2], [12] if
the computations are more sophisticated. In most cases, such
scoring process can work in a single table scan, providing
good performance.
IV. COMPARING PROCESSING OF WORKFLOWS INSIDE
AND OUTSIDE A DATABASE SYSTEM
In this section we present a qualitative and quantitative
comparison between running big data analytics workflows
inside and outside the database system. This discussion is
a summary of representative successful projects. We first
discuss a typical database environment. Second, we present a
summary of the data mining projects presenting their practical
application and the statistical and data mining techniques
used. Third, we discuss advantages and accomplishments for
each workflow running inside the database system, as well
as the main objections or concerns against such approach
(i.e. migrating external transformation code into the database
system). Fourth, we present time measurements taken from
actual projects at each organization, running big data analytics
workflows completely inside and partially outside the database
system.
A. Data Warehousing Environment
The environment was a data warehouse, where several
databases were already integrated into a large enterprise-wide
database. The database server was surrounded by specialized
servers performing OLAP and statistical analysis. One of those
servers was a statistical server with a fast network connection
to the database server.
First of all, an entire set of statistical language programs
were translated into SQL using Teradata data mining program,
the translator tool and customized SQL code. Second, in every
case the data sets were verified to have the same contents in
the statistical language and SQL. In most cases, the numeric
output from statistical and machine learning models was the
same, but sometimes there were slight numeric differences,
652
given variations in algorithmic improvements and advanced
parameters (e.g. epsilon for convergence, step-wise regression
procedures, pruning method in decision tree and so on).
B. Organizations: statistical models and business application
We now give a brief discussion about the organizations
where the statistical code migration took place. We also
discuss the specific type of data mining techniques used in
each case. Due to privacy concerns we omit discussion of
specific information about each organization, their databases
and the hardware configuration of their database system
servers. The big data analytics workflows were executed on the
organizations database servers, concurrently with other users
(analysts, managers, DBAs, and so on).
We now describe the computers processing the workflow
in more detail. The database system server was, in general,
a parallel multiprocessor computer with a large number of
CPUs, ample memory per CPU and several terabytes of
parallel disk storage in high performance RAID configurations.
On the other hand, the statistical server was generally a smaller
computer with less than 500 GB of disk space with ample
memory space. Statistical and data mining analysis inside the
database system was performed only with SQL. In general, a
workstation connected to each server with appropriate client
utilities. The connection to the database system was done with
ODBC. All time measurements discussed herein were taken
on 32-bit CPUs over the course of several years. Therefore,
they cannot be compared with each other and they should
only be used to understand performance gains within the same
organization.
The first organization was an insurance company. The
data mining goal involved segmenting customers into tiers
according to their profitability based on demographic data,
billing information and claims. The statistical techniques used
to determine segments involved histograms and clustering.
The final data set had about n = 300k records and d = 25
variables. There were four segments, categorizing customers
from best to worst.
The second organization was a cellular telephone service
provider. The data mining task involved predicting which
customers were likely to upgrade their call service package
or purchase a new handset. The default technique was logistic
regression [7] with stepwise procedure for variable selection.
The data set used for scoring had about n = 10M records and
d = 120 variables. The predicted variable was binary.
The third organization was an Internet Service Provider
(ISP). The predictive task was to detect which customers were
likely to disconnect service within a time window of a few
months, based on their demographic data, billing information
and service usage. The statistical techniques used in this case
were decision trees and logistic regression and the predicted
variable was binary. The final data set had n = 3.5M records
and d = 50 variables.
C. Database system-centric big data analytics workflows:
Users Opinion
We summarize pros and cons about running big data analyt-
ics workflows inside and outside the database system. Table I
contains a summary of outcomes. As we can see performance
to score data sets and transforming data sets are positive
outcomes in every case. Building the models faster turned
out not be as important because users relied on sampling to
build models and several samples were collected to tune and
test models. Since all databases and servers were within the
same firewall security was not a major concern. In general,
improving data management was not seen as major concern
because there existed a data warehouse, but users acknowledge
a “distributed” analytic environment could be a potential
management issue. We now summarize the main objections,
despite the advantages discussed above. We exclude cost as a
decisive factor to preserve anonymity of users opinion and
give an unbiased discussion. First, many users preferred a
traditional programming language like Java or C++ instead
of a set-oriented language like SQL. Second, some specialized
techniques are not available in the database system due to their
mathematical complexity; relevant examples include Support
Vector Machines, Non-linear regression and time series mod-
els. Finally, sampling is a standard mechanism to analyze large
data sets.
D. Workflow Processing Time Comparison
We compare workflow processing time inside the database
system using SQL and UDFs and outside the database system,
using an external server transforming exported data sets and
computing models. In general, the external server ran existing
data transformation programs developed by each organization.
On the other hand, each organization had diverse data mining
tools that analyzed flat files. We must mention the comparison
is not fair because the database system server was in general
a powerful parallel computer and the external server was a
smaller computer. Nevertheless, such setting represents a com-
mon IT environment where the fastest computer is precisely
the database system server.
We discuss tables from the database in more detail. There
were several input tables coming from a large normalized
database that were transformed and denormalized to build
data sets used by statistical or machine learning techniques.
In short, the input were tables and the output were tables as
well. No data sets were exported in this case: all processing
happened inside the database system. On the other hand,
analysis on the external server relied on SQL queries to extract
data from the database system, transform the data to produce
data sets in the statistical server and then building models or
scoring data sets based on a model. In general, data extraction
from the database system was performed using bulk utilities
which exported data records in blocks. Clearly, there was an
export bottleneck from the database system to the external
server.
Table II compares performance between both workflow pro-
cessing alternatives: inside and outside the database system. As
653
TABLE I
ADVANTAGES AND DISADVANTAGES ON RUNNING BIG
DATA ANALYTICS WORKFLOWS INSIDE A DATABASE
SYSTEM.
Insur. Phone ISP
Advantages:
Improve Workflow Management X X X
Accelerate Workflow Execution X
Prepare data sets more easily X X X
Compute models faster without sampling X
Score data sets faster X X X
Enforce database security X X
Disadvantages:
Workflow output independent X X
Workflow outside database system OK X X
Prefer program. lang. over SQL X X
Samples accelerate modeling X X
database system lacks statistical models X
Legacy code X
TABLE II
WORKFLOW PROCESSING TIME INSIDE AND OUTSIDE A
DATABASE
SYSTEM (TIME IN MINUTES).
Task outside inside
Build model:
Segmentation 2 1
Predict propensity 38 8
Predict churn 120 20
Score data set:
Segmentation 5 1
Predict propensity 150 2
Predict churn 10 1
TABLE III
TIME TO COMPUTE MODELS INSIDE THE DATABASE
SYSTEM AND TIME TO
EXPORT DATA SET (SECS).
n × 1000 d SQL/UDF BULK ODBC
100 8 4 17 168
100 16 5 32 311
100 32 6 63 615
100 64 8 121 1204
1000 8 40 164 1690
1000 16 51 319 3112
1000 32 62 622 6160
1000 64 78 1188 12010
introduced in Section II we distinguish two big data analytics
workflows: computing the model and scoring the model on
large data sets. The times shown in Table II include the time
to transform the data set with joins and aggregations the time
to compute the model and the time to score applying the best
model. As we can see the database system is significantly
faster. We must mention that to build the predictive models
both approaches exploited samples from a large data set. Then
the models were tuned with further samples. To score data
sets the gap is wider, highlighting the efficiency of SQL to
compute joins and aggregations to build the data set and then
computing mathematical equations on the data set. In general,
the main reason the external server was slower was the time
to export data from the database system. A secondary reason
was its more limited computing power.
Table III compares processing time to compute a model
inside the database system and the time to export the data
set (with a bulk utility and ODBC). In this case the database
system runs on a relatively small computer with 3.2 GHz, 4
GB on memory and 1 TB on disk. The models include PCA,
Naive Bayes and linear regression, which can be derived from
the correlation matrix [7] of the data set in a single table
scan using SQL queries and UDFs. Exporting the data set is
a bottleneck to perform data mining processing outside the
database system, regardless of how fast the external server is.
However, exporting a sample from the data set can be done
fast, but analyzing a large data set without sampling is much
faster to do inside the database system. Also, sampling can
also be exploited inside the database system.
V. RELATED WORK
There exist many proposals that extend SQL with data
mining functionality. Most proposals add syntax to SQL
and optimize queries using the proposed extensions. Several
techniques to execute aggregate UDFs in parallel are studied
in [9]; these ideas are currently used by modern parallel
relational database systems such as Teradata. SQL extensions
to define, query and deploy data mining models are proposed
in [11]; such extensions provide a friendly language interface
to manage data mining models. This proposal focuses on man-
aging models rather than computing them and therefore such
extensions are complementary to UDFs. Query optimization
techniques and a simple SQL syntax extension to compute
multidimensional histograms are proposed in [8], where a
multiple grouping clause is optimized.
Some related work on exploiting SQL for data manipulation
tasks includes the following. Data mining primitive operators
are proposed in [1], including an operator to pivot a table
and another one for sampling, useful to build data sets. The
pivot/unpivot operators are extremely useful to transpose and
transform data sets for data mining and OLAP tasks [3],
but they have not been standardized. Horizontal aggregations
were proposed to create tabular data sets [13], as required
by statistical and machine learning techniques, combining
pivoting and aggregation in one function. For the most part
research work on preparing data sets for analytic purposes in
a relational database system remains scarce. Data quality is a
fundamental aspect in a data mining application. Referential
integrity quality metrics are proposed in [14], where users can
654
isolate tables and columns with invalid foreign keys. Such
referential problems must be solved in order to build a data
set without missing information.
Mining workflow logs, a different problem, has received at-
tention [18]; the basic idea is to discover patterns in workflow
process logs. To the best of our knowledge, there is scarce
research work dealing with migrating data preprocessing into
a database system to improve management and processing of
big data analytics workflows.
VI. CONCLUSIONS
We presented practical issues and discussed common solu-
tions to push data preprocessing into a database system to
improve management and processing of big data analytics
workflows. It is important to emphasize data preprocessing
is generally the most time consuming and error-prone task
in a big data analytics workflow. We identified specific data
preprocessing issues. Summarization generally has to be done
at different granularity levels and such levels are generally not
hierarchical. Rows are selected based on a time window, which
requires indexes on date columns. Row selection (filtering)
with complex predicates happens on many tables, making code
maintenance and query optimization difficult. In general, it is
necessary to create a “universe” data set to define left outer
joins. Model deployment requires importing models as SQL
queries or UDFs to deploy a model on large data sets. Based
on experience from real-life projects, we compared advantages
and disadvantages when running big data analytics workflows
entirely inside and outside a database system. Our general
observations are the following. big data analytics workflows
are easier to manage inside a database system. A big data
analytics workflow is generally faster to run inside a database
system assuming a data warehousing environment (i.e. data
originates from the database). From a practical perspective
workflows are easier to manage inside a database system
because users can exploit the extensive capabilities of the
database system (querying, recovery, security and concurrency
control) and there is less data redundancy. However, external
statistical tools may provide more flexibility than SQL and
more statistical techniques. From an efficiency (processing
time) perspective, transforming and scoring data sets and
computing models are faster to develop and run inside the
database system. Nevertheless, sampling represent a practical
solution to accelerate big data analytics workflows running
outside a database system.
Improving and optimizing the management and processing
of big data analytics workflows provides many opportunities
for future work. It is necessary to specify models which take
into account processing with external data mining tools. The
data set tends to be the bottleneck in a big data analytics
workflow both from a programming and processing time
perspectives. Therefore, it is necessary to specify workflow
models which can serve as templates to prepare data sets. The
role of the statistical model in a workflow needs to studied in
the context of the source tables that were used to build the
data set; very few organizations reach that stage.
REFERENCES
[1] J. Clear, D. Dunn, B. Harvey, M.L. Heytens, and P. Lohman.
Non-stop
SQL/MX primitives for knowledge discovery. In ACM KDD
Conference,
pages 425–429, 1999.
[2] J. Cohen, B. Dolan, M. Dunlap, J. Hellerstein, and C.
Welton. MAD
skills: New analysis practices for big data. In Proc. VLDB
Conference,
pages 1481–1492, 2009.
[3] C. Cunningham, G. Graefe, and C.A. Galindo-Legaria.
PIVOT and
UNPIVOT: Optimization and execution strategies in an
RDBMS. In
Proc. VLDB Conference, pages 998–1009, 2004.
[4] R. Elmasri and S.B. Navathe. Fundamentals of Database
Systems.
Addison-Wesley, 4th edition, 2003.
[5] U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth. The KDD
process for
extracting useful knowledge from volumes of data.
Communications of
the ACM, 39(11):27–34, November 1996.
[6] J. Han and M. Kamber. Data Mining: Concepts and
Techniques. Morgan
Kaufmann, San Francisco, 2nd edition, 2006.
[7] T. Hastie, R. Tibshirani, and J.H. Friedman. The Elements
of Statistical
Learning. Springer, New York, 1st edition, 2001.
[8] A. Hinneburg, D. Habich, and W. Lehner. Combi-operator-
database
support for data mining applications. In Proc. VLDB
Conference, pages
429–439, 2003.
[9] M. Jaedicke and B. Mitschang. On parallel processing of
aggregate
and scalar functions in object-relational DBMS. In ACM
SIGMOD
Conference, pages 379–389, 1998.
[10] T.M. Mitchell. Machine Learning. Mac-Graw Hill, New
York, 1997.
[11] A. Netz, S. Chaudhuri, U. Fayyad, and J. Berhardt.
Integrating data
mining with SQL databases: OLE DB for data mining. In Proc.
IEEE
ICDE Conference, pages 379–387, 2001.
[12] C. Ordonez. Statistical model computation with UDFs.
IEEE Transac-
tions on Knowledge and Data Engineering (TKDE),
22(12):1752–1765,
2010.
[13] C. Ordonez and Z. Chen. Horizontal aggregations in SQL
to prepare
data sets for data mining analysis. IEEE Transactions on
Knowledge
and Data Engineering (TKDE), 24(4):678–691, 2012.
[14] C. Ordonez and J. Garcı́a-Garcı́a. Referential integrity
quality metrics.
Decision Support Systems Journal, 44(2):495–508, 2008.
[15] C. Ordonez and J. Garcı́a-Garcı́a. Database systems
research on data
mining. In Proc. ACM SIGMOD Conference, pages 1253–1254,
2010.
[16] D. Pyle. Data preparation for data mining. Morgan
Kaufmann
Publishers Inc., San Francisco, CA, USA, 1999.
[17] E. Rahm and D. Hong-Hai. Data cleaning: Problems and
current
approaches. IEEE Bulletin of the Technical Committee on Data
En-
gineering, 23(4), 2000.
[18] W. M. P. van der Aalst, B. F. van Dongen, J. Herbst, L.
Maruster,
G. Schimm, and A. J. M. M. Weijters. Workflow mining: a
survey of
issues and approaches. Data & Knowledge Engineering,
47(2):237–267,
2003.
655

More Related Content

Similar to Database Integrated Analytics using R InitialExperiences wi

Unstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelUnstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelEditor IJCATR
 
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...Gezim Sejdiu
 
Sybase IQ ile Analitik Platform
Sybase IQ ile Analitik PlatformSybase IQ ile Analitik Platform
Sybase IQ ile Analitik PlatformSybase Türkiye
 
Building an analytical platform
Building an analytical platformBuilding an analytical platform
Building an analytical platformDavid Walker
 
Why no sql_ibm_cloudant
Why no sql_ibm_cloudantWhy no sql_ibm_cloudant
Why no sql_ibm_cloudantPeter Tutty
 
NoSQL Databases Introduction - UTN 2013
NoSQL Databases Introduction - UTN 2013NoSQL Databases Introduction - UTN 2013
NoSQL Databases Introduction - UTN 2013Facundo Farias
 
Introduction to Property Graph Features (AskTOM Office Hours part 1)
Introduction to Property Graph Features (AskTOM Office Hours part 1) Introduction to Property Graph Features (AskTOM Office Hours part 1)
Introduction to Property Graph Features (AskTOM Office Hours part 1) Jean Ihm
 
A Survey on Data Mapping Strategy for data stored in the storage cloud 111
A Survey on Data Mapping Strategy for data stored in the storage cloud  111A Survey on Data Mapping Strategy for data stored in the storage cloud  111
A Survey on Data Mapping Strategy for data stored in the storage cloud 111NavNeet KuMar
 
PERFORMANCE EVALUATION OF SOCIAL NETWORK ANALYSIS ALGORITHMS USING DISTRIBUTE...
PERFORMANCE EVALUATION OF SOCIAL NETWORK ANALYSIS ALGORITHMS USING DISTRIBUTE...PERFORMANCE EVALUATION OF SOCIAL NETWORK ANALYSIS ALGORITHMS USING DISTRIBUTE...
PERFORMANCE EVALUATION OF SOCIAL NETWORK ANALYSIS ALGORITHMS USING DISTRIBUTE...Journal For Research
 
The Recent Pronouncement Of The World Wide Web (Www) Had
The Recent Pronouncement Of The World Wide Web (Www) HadThe Recent Pronouncement Of The World Wide Web (Www) Had
The Recent Pronouncement Of The World Wide Web (Www) HadDeborah Gastineau
 
The “Big Data” Ecosystem at LinkedIn
The “Big Data” Ecosystem at LinkedInThe “Big Data” Ecosystem at LinkedIn
The “Big Data” Ecosystem at LinkedInKun Le
 
The "Big Data" Ecosystem at LinkedIn
The "Big Data" Ecosystem at LinkedInThe "Big Data" Ecosystem at LinkedIn
The "Big Data" Ecosystem at LinkedInSam Shah
 
Enabling SQL Access to Data Lakes
Enabling SQL Access to Data LakesEnabling SQL Access to Data Lakes
Enabling SQL Access to Data LakesVasu S
 
Comparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and sparkComparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and sparkAgnihotriGhosh2
 
RDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs SparkRDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs SparkLaxmi8
 
Oracle restful api & data live charting by Oracle Apex - داشبورد آنلاین (داده...
Oracle restful api & data live charting by Oracle Apex - داشبورد آنلاین (داده...Oracle restful api & data live charting by Oracle Apex - داشبورد آنلاین (داده...
Oracle restful api & data live charting by Oracle Apex - داشبورد آنلاین (داده...mahdi ahmadi
 

Similar to Database Integrated Analytics using R InitialExperiences wi (20)

Unstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelUnstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus Model
 
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...
 
Sybase IQ ile Analitik Platform
Sybase IQ ile Analitik PlatformSybase IQ ile Analitik Platform
Sybase IQ ile Analitik Platform
 
Building an analytical platform
Building an analytical platformBuilding an analytical platform
Building an analytical platform
 
Why no sql_ibm_cloudant
Why no sql_ibm_cloudantWhy no sql_ibm_cloudant
Why no sql_ibm_cloudant
 
NoSQL Databases Introduction - UTN 2013
NoSQL Databases Introduction - UTN 2013NoSQL Databases Introduction - UTN 2013
NoSQL Databases Introduction - UTN 2013
 
Ss eb29
Ss eb29Ss eb29
Ss eb29
 
Introduction to Property Graph Features (AskTOM Office Hours part 1)
Introduction to Property Graph Features (AskTOM Office Hours part 1) Introduction to Property Graph Features (AskTOM Office Hours part 1)
Introduction to Property Graph Features (AskTOM Office Hours part 1)
 
A Survey on Data Mapping Strategy for data stored in the storage cloud 111
A Survey on Data Mapping Strategy for data stored in the storage cloud  111A Survey on Data Mapping Strategy for data stored in the storage cloud  111
A Survey on Data Mapping Strategy for data stored in the storage cloud 111
 
Resume - RK
Resume - RKResume - RK
Resume - RK
 
PERFORMANCE EVALUATION OF SOCIAL NETWORK ANALYSIS ALGORITHMS USING DISTRIBUTE...
PERFORMANCE EVALUATION OF SOCIAL NETWORK ANALYSIS ALGORITHMS USING DISTRIBUTE...PERFORMANCE EVALUATION OF SOCIAL NETWORK ANALYSIS ALGORITHMS USING DISTRIBUTE...
PERFORMANCE EVALUATION OF SOCIAL NETWORK ANALYSIS ALGORITHMS USING DISTRIBUTE...
 
The Recent Pronouncement Of The World Wide Web (Www) Had
The Recent Pronouncement Of The World Wide Web (Www) HadThe Recent Pronouncement Of The World Wide Web (Www) Had
The Recent Pronouncement Of The World Wide Web (Www) Had
 
No sql database
No sql databaseNo sql database
No sql database
 
The “Big Data” Ecosystem at LinkedIn
The “Big Data” Ecosystem at LinkedInThe “Big Data” Ecosystem at LinkedIn
The “Big Data” Ecosystem at LinkedIn
 
The "Big Data" Ecosystem at LinkedIn
The "Big Data" Ecosystem at LinkedInThe "Big Data" Ecosystem at LinkedIn
The "Big Data" Ecosystem at LinkedIn
 
Enabling SQL Access to Data Lakes
Enabling SQL Access to Data LakesEnabling SQL Access to Data Lakes
Enabling SQL Access to Data Lakes
 
Comparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and sparkComparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and spark
 
RDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs SparkRDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs Spark
 
Selecting best NoSQL
Selecting best NoSQL Selecting best NoSQL
Selecting best NoSQL
 
Oracle restful api & data live charting by Oracle Apex - داشبورد آنلاین (داده...
Oracle restful api & data live charting by Oracle Apex - داشبورد آنلاین (داده...Oracle restful api & data live charting by Oracle Apex - داشبورد آنلاین (داده...
Oracle restful api & data live charting by Oracle Apex - داشبورد آنلاین (داده...
 

More from OllieShoresna

this assignment is about Mesopotamia and Egypt. Some of these cu.docx
this assignment is about Mesopotamia and Egypt. Some of these cu.docxthis assignment is about Mesopotamia and Egypt. Some of these cu.docx
this assignment is about Mesopotamia and Egypt. Some of these cu.docxOllieShoresna
 
This assignment has two goals 1) have students increase their under.docx
This assignment has two goals 1) have students increase their under.docxThis assignment has two goals 1) have students increase their under.docx
This assignment has two goals 1) have students increase their under.docxOllieShoresna
 
This assignment has two parts 1 paragraph per questionIn wh.docx
This assignment has two parts 1 paragraph per questionIn wh.docxThis assignment has two parts 1 paragraph per questionIn wh.docx
This assignment has two parts 1 paragraph per questionIn wh.docxOllieShoresna
 
This assignment is a minimum of 100 word all parts of each querstion.docx
This assignment is a minimum of 100 word all parts of each querstion.docxThis assignment is a minimum of 100 word all parts of each querstion.docx
This assignment is a minimum of 100 word all parts of each querstion.docxOllieShoresna
 
This assignment has three elements a traditional combination format.docx
This assignment has three elements a traditional combination format.docxThis assignment has three elements a traditional combination format.docx
This assignment has three elements a traditional combination format.docxOllieShoresna
 
This assignment has four partsWhat changes in business software p.docx
This assignment has four partsWhat changes in business software p.docxThis assignment has four partsWhat changes in business software p.docx
This assignment has four partsWhat changes in business software p.docxOllieShoresna
 
This assignment consists of two partsthe core evaluation, a.docx
This assignment consists of two partsthe core evaluation, a.docxThis assignment consists of two partsthe core evaluation, a.docx
This assignment consists of two partsthe core evaluation, a.docxOllieShoresna
 
This assignment asks you to analyze a significant textual elemen.docx
This assignment asks you to analyze a significant textual elemen.docxThis assignment asks you to analyze a significant textual elemen.docx
This assignment asks you to analyze a significant textual elemen.docxOllieShoresna
 
This assignment allows you to learn more about one key person in Jew.docx
This assignment allows you to learn more about one key person in Jew.docxThis assignment allows you to learn more about one key person in Jew.docx
This assignment allows you to learn more about one key person in Jew.docxOllieShoresna
 
This assignment allows you to explore the effects of social influe.docx
This assignment allows you to explore the effects of social influe.docxThis assignment allows you to explore the effects of social influe.docx
This assignment allows you to explore the effects of social influe.docxOllieShoresna
 
This assignment addresses pretrial procedures that occur prior to th.docx
This assignment addresses pretrial procedures that occur prior to th.docxThis assignment addresses pretrial procedures that occur prior to th.docx
This assignment addresses pretrial procedures that occur prior to th.docxOllieShoresna
 
This assignment allows you to learn more about one key person in J.docx
This assignment allows you to learn more about one key person in J.docxThis assignment allows you to learn more about one key person in J.docx
This assignment allows you to learn more about one key person in J.docxOllieShoresna
 
This assignment allows you to explore the effects of social infl.docx
This assignment allows you to explore the effects of social infl.docxThis assignment allows you to explore the effects of social infl.docx
This assignment allows you to explore the effects of social infl.docxOllieShoresna
 
this about communication please i eant you answer this question.docx
this about communication please i eant you answer this question.docxthis about communication please i eant you answer this question.docx
this about communication please i eant you answer this question.docxOllieShoresna
 
Think of a time when a company did not process an order or perform a.docx
Think of a time when a company did not process an order or perform a.docxThink of a time when a company did not process an order or perform a.docx
Think of a time when a company did not process an order or perform a.docxOllieShoresna
 
Think_Vision W5- Importance of VaccinationImportance of Vaccinatio.docx
Think_Vision W5- Importance of VaccinationImportance of Vaccinatio.docxThink_Vision W5- Importance of VaccinationImportance of Vaccinatio.docx
Think_Vision W5- Importance of VaccinationImportance of Vaccinatio.docxOllieShoresna
 
Thinks for both only 50 words as much for each one1-xxxxd, unf.docx
Thinks for both only 50 words as much for each one1-xxxxd, unf.docxThinks for both only 50 words as much for each one1-xxxxd, unf.docx
Thinks for both only 50 words as much for each one1-xxxxd, unf.docxOllieShoresna
 
Think of a specific change you would like to bring to your organizat.docx
Think of a specific change you would like to bring to your organizat.docxThink of a specific change you would like to bring to your organizat.docx
Think of a specific change you would like to bring to your organizat.docxOllieShoresna
 
Think of a possible change initiative in your selected organization..docx
Think of a possible change initiative in your selected organization..docxThink of a possible change initiative in your selected organization..docx
Think of a possible change initiative in your selected organization..docxOllieShoresna
 
Thinking About Research PaperConsider the research question and .docx
Thinking About Research PaperConsider the research question and .docxThinking About Research PaperConsider the research question and .docx
Thinking About Research PaperConsider the research question and .docxOllieShoresna
 

More from OllieShoresna (20)

this assignment is about Mesopotamia and Egypt. Some of these cu.docx
this assignment is about Mesopotamia and Egypt. Some of these cu.docxthis assignment is about Mesopotamia and Egypt. Some of these cu.docx
this assignment is about Mesopotamia and Egypt. Some of these cu.docx
 
This assignment has two goals 1) have students increase their under.docx
This assignment has two goals 1) have students increase their under.docxThis assignment has two goals 1) have students increase their under.docx
This assignment has two goals 1) have students increase their under.docx
 
This assignment has two parts 1 paragraph per questionIn wh.docx
This assignment has two parts 1 paragraph per questionIn wh.docxThis assignment has two parts 1 paragraph per questionIn wh.docx
This assignment has two parts 1 paragraph per questionIn wh.docx
 
This assignment is a minimum of 100 word all parts of each querstion.docx
This assignment is a minimum of 100 word all parts of each querstion.docxThis assignment is a minimum of 100 word all parts of each querstion.docx
This assignment is a minimum of 100 word all parts of each querstion.docx
 
This assignment has three elements a traditional combination format.docx
This assignment has three elements a traditional combination format.docxThis assignment has three elements a traditional combination format.docx
This assignment has three elements a traditional combination format.docx
 
This assignment has four partsWhat changes in business software p.docx
This assignment has four partsWhat changes in business software p.docxThis assignment has four partsWhat changes in business software p.docx
This assignment has four partsWhat changes in business software p.docx
 
This assignment consists of two partsthe core evaluation, a.docx
This assignment consists of two partsthe core evaluation, a.docxThis assignment consists of two partsthe core evaluation, a.docx
This assignment consists of two partsthe core evaluation, a.docx
 
This assignment asks you to analyze a significant textual elemen.docx
This assignment asks you to analyze a significant textual elemen.docxThis assignment asks you to analyze a significant textual elemen.docx
This assignment asks you to analyze a significant textual elemen.docx
 
This assignment allows you to learn more about one key person in Jew.docx
This assignment allows you to learn more about one key person in Jew.docxThis assignment allows you to learn more about one key person in Jew.docx
This assignment allows you to learn more about one key person in Jew.docx
 
This assignment allows you to explore the effects of social influe.docx
This assignment allows you to explore the effects of social influe.docxThis assignment allows you to explore the effects of social influe.docx
This assignment allows you to explore the effects of social influe.docx
 
This assignment addresses pretrial procedures that occur prior to th.docx
This assignment addresses pretrial procedures that occur prior to th.docxThis assignment addresses pretrial procedures that occur prior to th.docx
This assignment addresses pretrial procedures that occur prior to th.docx
 
This assignment allows you to learn more about one key person in J.docx
This assignment allows you to learn more about one key person in J.docxThis assignment allows you to learn more about one key person in J.docx
This assignment allows you to learn more about one key person in J.docx
 
This assignment allows you to explore the effects of social infl.docx
This assignment allows you to explore the effects of social infl.docxThis assignment allows you to explore the effects of social infl.docx
This assignment allows you to explore the effects of social infl.docx
 
this about communication please i eant you answer this question.docx
this about communication please i eant you answer this question.docxthis about communication please i eant you answer this question.docx
this about communication please i eant you answer this question.docx
 
Think of a time when a company did not process an order or perform a.docx
Think of a time when a company did not process an order or perform a.docxThink of a time when a company did not process an order or perform a.docx
Think of a time when a company did not process an order or perform a.docx
 
Think_Vision W5- Importance of VaccinationImportance of Vaccinatio.docx
Think_Vision W5- Importance of VaccinationImportance of Vaccinatio.docxThink_Vision W5- Importance of VaccinationImportance of Vaccinatio.docx
Think_Vision W5- Importance of VaccinationImportance of Vaccinatio.docx
 
Thinks for both only 50 words as much for each one1-xxxxd, unf.docx
Thinks for both only 50 words as much for each one1-xxxxd, unf.docxThinks for both only 50 words as much for each one1-xxxxd, unf.docx
Thinks for both only 50 words as much for each one1-xxxxd, unf.docx
 
Think of a specific change you would like to bring to your organizat.docx
Think of a specific change you would like to bring to your organizat.docxThink of a specific change you would like to bring to your organizat.docx
Think of a specific change you would like to bring to your organizat.docx
 
Think of a possible change initiative in your selected organization..docx
Think of a possible change initiative in your selected organization..docxThink of a possible change initiative in your selected organization..docx
Think of a possible change initiative in your selected organization..docx
 
Thinking About Research PaperConsider the research question and .docx
Thinking About Research PaperConsider the research question and .docxThinking About Research PaperConsider the research question and .docx
Thinking About Research PaperConsider the research question and .docx
 

Recently uploaded

Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfSherif Taha
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...Nguyen Thanh Tu Collection
 
Tatlong Kwento ni Lola basyang-1.pdf arts
Tatlong Kwento ni Lola basyang-1.pdf artsTatlong Kwento ni Lola basyang-1.pdf arts
Tatlong Kwento ni Lola basyang-1.pdf artsNbelano25
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfDr Vijay Vishwakarma
 
OSCM Unit 2_Operations Processes & Systems
OSCM Unit 2_Operations Processes & SystemsOSCM Unit 2_Operations Processes & Systems
OSCM Unit 2_Operations Processes & SystemsSandeep D Chaudhary
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.MaryamAhmad92
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Jisc
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Pooja Bhuva
 
Philosophy of china and it's charactistics
Philosophy of china and it's charactisticsPhilosophy of china and it's charactistics
Philosophy of china and it's charactisticshameyhk98
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxJisc
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxDr. Ravikiran H M Gowda
 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxCeline George
 
21st_Century_Skills_Framework_Final_Presentation_2.pptx
21st_Century_Skills_Framework_Final_Presentation_2.pptx21st_Century_Skills_Framework_Final_Presentation_2.pptx
21st_Century_Skills_Framework_Final_Presentation_2.pptxJoelynRubio1
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfNirmal Dwivedi
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxPooja Bhuva
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024Elizabeth Walsh
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the ClassroomPooky Knightsmith
 
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxOn_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxPooja Bhuva
 
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxCOMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxannathomasp01
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - Englishneillewis46
 

Recently uploaded (20)

Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Tatlong Kwento ni Lola basyang-1.pdf arts
Tatlong Kwento ni Lola basyang-1.pdf artsTatlong Kwento ni Lola basyang-1.pdf arts
Tatlong Kwento ni Lola basyang-1.pdf arts
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
 
OSCM Unit 2_Operations Processes & Systems
OSCM Unit 2_Operations Processes & SystemsOSCM Unit 2_Operations Processes & Systems
OSCM Unit 2_Operations Processes & Systems
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
 
Philosophy of china and it's charactistics
Philosophy of china and it's charactisticsPhilosophy of china and it's charactistics
Philosophy of china and it's charactistics
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptx
 
21st_Century_Skills_Framework_Final_Presentation_2.pptx
21st_Century_Skills_Framework_Final_Presentation_2.pptx21st_Century_Skills_Framework_Final_Presentation_2.pptx
21st_Century_Skills_Framework_Final_Presentation_2.pptx
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptx
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxOn_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
 
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxCOMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 

Database Integrated Analytics using R InitialExperiences wi

  • 1. Database Integrated Analytics using R: Initial Experiences with SQL-Server + R Josep Ll. Berral and Nicolas Poggi Barcelona Supercomputing Center (BSC) Universitat Politècnica de Catalunya (BarcelonaTech) Barcelona, Spain Abstract—Most data scientists use nowadays functional or semi-functional languages like SQL, Scala or R to treat data, obtained directly from databases. Such process requires to fetch data, process it, then store again, and such process tends to be done outside the DB, in often complex data-flows. Recently, database service providers have decided to integrate “R-as-a- Service” in their DB solutions. The analytics engine is called directly from the SQL query tree, and results are returned as part of the same query. Here we show a first taste of such technology by testing the portability of our ALOJA-ML analytics framework, coded in R, to Microsoft SQL-Server 2016, one of the SQL+R solutions released recently. In this work we discuss some data-flow schemes for porting a local DB + analytics engine architecture towards Big Data, focusing specially on the new DB Integrated Analytics approach, and commenting the first experiences in usability and performance obtained from such new services and capabilities. I. INTRODUCTION
  • 2. Current data mining methodologies, techniques and algo- rithms are based in heavy data browsing, slicing and process- ing. For data scientists, also users of analytics, the capability of defining the data to be retrieved and the operations to be applied over this data in an easy way is essential. This is the reason why functional languages like SQL, Scala or R are so popular in such fields as, although these languages allow high level programming, they free the user from programming the infrastructure for accessing and browsing data. The usual trend when processing data is to fetch the data from the source or storage (file system or relational database), bring it into a local environment (memory, distributed workers, ...), treat it, and then store back the results. In such schema functional language applications are used to retrieve and slice the data, while imperative language applications are used to process the data and manage the data-flow between systems. In most languages and frameworks, database connection pro- tocols like ODBC or JDBC are available to enhance this data-
  • 3. flow, allowing applications to directly retrieve data from DBs. And although most SQL-based DB services allow user-written procedures and functions, these do not include a high variety of primitive functions or operators. The arrival of the Big Data favored distributed frameworks like Apache Hadoop and Apache Spark, where the data is distributed “in the Cloud” and the data processing can also be distributed where the data is placed, then results are joined and aggregated. Such technologies have the advantage of distributed computing, but when the schema for accessing data and using it is still the same, just that the data distribution is transparent to the user. Still, the user is responsible of adapting any analytics to a Map-Reduce schema, and be responsible of the data infrastructure. Recently, companies like Microsoft, IBM or Cisco, providers of Analytics as a Service platforms, put special effort into complementing their solutions by adding script- ing mechanisms into their DB engines, allowing to embed analytics mechanisms into the same DB environment. All of them selected R [17] as a language and analytics engine, a
  • 4. free and open-source statistical oriented language and engine, embraced by the data mining community since long time ago. The current paradigm of “Fetch from DB, Process, Dump to DB” is shifted towards an “In-DB Processing” schema, so the operations to be done on the selected data are provided by the same DB procedures catalog. All the computation remains inside the DB service, so the daily user can proceed by simply querying the DB in a SQL style. New R-based procedures, built-in or user created by invoking R scripts and libraries, are executed as regular operations inside the query execution tree. The idea of such integration is that, not only this processing will be more usable by querying the DB, where the data is managed and distributed, but also will reduce the overhead of the data pipe-line, as everything will remain inside a single data framework. Further, for continuous data processing, analytics procedures and functions can be directly called from triggers when data is continuously introduced or modified. As a case of use of such approach, in this paper we present
  • 5. some experiences on porting the ALOJA framework [13] to the recently released SQL-Server 2016, incorporating this R- Service functionality. The ALOJA-ML framework is a col- lection of predictive analytics functions (machine learning and data mining), written in R, originally purposed for mod- eling and prediction High Performance Computing (HPC) benchmarking workloads as part of the ALOJA Project [16], deployed to be called after retrieving data from a MySQL database. Such project collects traces and profiling of Big Data framework technologies, and analyzing this data requires predictive analytics. These procedures are deployed in R, and communicating the framework with the database and R engine results in a complex architecture, with a fragile data-pipeline and susceptible to failures. In this work we present how we could adapt our current data-processing approach to a more Big-Data oriented archi- 2016 IEEE 16th International Conference on Data Mining Workshops
  • 6. 2375-9259/16 $31.00 © 2016 IEEE DOI 10.1109/ICDMW.2016.155 1 tecture, and how we tested using the ALOJA data-set [12], as an example and as a review from the user’s point of view. After testing the Microsoft version of SQL+R services [10], we saw that the major complication, far away of deploying the service, is to build the SQL wrapping procedures for the R scripts to be executed. When reusing or porting already existing code or R applications, the wrapper just has to source (R function for “import”) the original code and its libraries, and execute the corresponding functions (plus bridging the parameters and the return of this function). Current services are prepared to transform 2 dimensional R Data Frames into tables, as a result of such procedures. Aside of Microsoft services, we took a look into Cisco ParStream [6], displaying a very similar approach, differing on the way of instantiating the R scripts through R file calls instead of direct scripting.
  • 7. It remains for the future work to test and compare the per- formances among platforms, also to include some experiences with IBM PureData Services [7] and any other new platform providing such services. This article is structured as follows: Section II presents the current state-of-art and recent approaches used when processing data from databases. Section III explains the current and new data-flow paradigm, and required changes. Section IV shows some code porting examples from our native R scripts to Microsoft SQL Server. Section V provides comments and de- tails on some experiments on running the ALOJA framework in the new architecture. Finally, section VI summarizes this current work and presents the conclusions and future work. II. STATE OF THE ART Current efforts on systems processing Big Data are mostly focused on building and improving distributed systems. Plat- forms like Apache Hadoop [1] and Apache Spark [4], with all their “satellite” technologies, are on the rise on Big Data
  • 8. processing environments. Those platforms, although being originally designed towards Java or Python applications, the significant weight of the data mining community using R, Scala or using SQL interfaces, encouraged the platforms strate- gists over the years to include interfaces and methodologies for using such languages. Revolution Analytics published in 2011 RHadoop [9], a set of packages for R users to launch Hadoop tasks. Such packages included HDFS [14] and HBase [2] han- dlers, with Map-Reduce and data processing function libraries adapted for Hadoop. This way, R scripts could dispatch par - allelizable functions (e.g. R “apply” functions) to be executed in distributed worker computing machines. The Apache Spark platform, developed by the Berkeley’s AMPlab [11] and Databricks [5] and released in 2014, focuses on four main applied data science topics: graph processing, machine learning, data streaming, and relational algebraic queries (SQL). For these, Spark is divided in four big pack- ages: GraphX, MLlib, SparkStream, and SparkSQL. SparkSQL
  • 9. provides a library for treating data through SQL syntax or through relational algebraic functions. Also recently, a R scripting interface has been added to the initial Java, Python and Scala interfaces, through the SparkR package, providing the Spark based parallelism functions, Map-Reduce and HBase or Hive [3] handlers. This way, R users can connect to Spark deployments and process data frames in a Map-Reduce manner, the same way they could do with RHadoop or using other languages. Being able to move the processing towards the database side becomes a challenge, but allows to integrate analytics into the same data management environment, letting the same framework that receives and stores data to process it, in the way it is configured (local, distributed...). For this purpose, companies providing database platforms and services put effort in adding data processing engines as integrated components to their solutions. Microsoft recently acquired Revolution Analyt- ics and its R engine, re-branded as R-Server [8], and connected
  • 10. to the Microsoft SQL-Server 2016 release [10]. IBM also released their platform Pure Data Systems for Analytics [7], providing database services including the vanilla R engine from the Comprehensive R Archive Network [17]. Also Cisco recently acquired ParStream [6], a streaming database product incorporating user defined functions, programmable as shared object libraries in C++ or as external scripts in R. Here we describe our first approach to integrate our R-based analytics engine into a SQL+R platform, the Microsoft SQL- Server 2016, primarily looking at the user experience, and discussing bout cases of use where one architecture would be preferred over others. III. DATA-FLOW ARCHITECTURES Here we show three basic schemes of data processing, the local ad-hoc schema of pull-process-push data, the new distributed schemes for Hadoop and Spark, and the “In- DataBase” approach using the DB integrated analytics ser- vices. Point out that there is not an universal schema that works for each situation, and each one serves better or worse depending on the situation. As an example we put the
  • 11. case of the ALOJA-ML framework as an example of such architectures, how it currently operates, the problematic, and how it would be adapted to new approaches. A. Local Environments In architectures where the data and processing capacity is in the same location, having data-sets stored in local file systems or direct access DBs (local or remote databases where the user or the application can access to retrieve or put data). When an analytics application wants to process data, it can access the DB to fetch the required data, store locally and then pass to the analytics engine. Results are collected by the application and pushed again to the DB, if needed. This is a classical schema before having distributed systems with distributed databases, or for systems where the processing is not considered big enough to build a distributed computing- power environment, or when the process to be applied on data cannot be distributed. For systems where the analytics application is just a user
  • 12. of the data, this schema might be the corresponding one, as the application fetches the data it is granted to view, then 2 do whatever it wants with it. Also for applications where computation does not require select big amounts of data, as the data required is a small fraction of the total Big Data, it is affordable to fetch the slice of data and process it locally. Also on systems where using libraries like snowfall [X], letting the user to tune the parallelism of R functions, can be deployed locally up to a point where this distribution requires heavier mechanisms like Hadoop or Spark. There are mechanisms and protocols, like ODBC or JDBC (e.g. RODBC library for R), allowing applications to access directly to DBs and fetch data without passing through the file system, but through the application memory space. This is, in case that the analytics application has granted direct access to the DB and understands the returning format. Figure 1 shows
  • 13. both archetypical local schemes. Fig. 1. Schema of local execution approaches, passing data through the File System or ODBC mechanisms In the first case of Figure 1 we suppose an scenario where the engine has no direct access to DB, as everything that goes into analytics is triggered by the base application. Data is retrieved and pre-processed, then piped to the analytics engine through the file system (files, pipes, ...). This requires coordination between application and engine, in order to communicate the data properly. Such scenarios can happen when the DB, the application and the engine do not belong to the same organization, or when they belong to services from different providers. Also it can happen when security issues arise, as the analytics can be provided by a not-so- trusted provider, and data must be anonymized before exiting the DB. Also, if the DB and analytics engine are provided by the same service provider, there will be means to communicate the
  • 14. storage with the engine in an ODBC or other way, allowing to fetch the data, process, and then return the results to the DB. As an example, the Microsoft Azure-ML services [15] allow the connection of its machine learning components with the Azure storage service, also the inclusion of R code as part of user-defined components. In such service, the machine learning operations do not happen on the storage service, but data is pulled into the engine and then processed. At this moment, the ALOJA Project, consists of a web user interface (front-end and back-end), a MySQL database and the R engine for analytics, used this schema of data pipe-line. The web interface provided the user the requested data from the ALOJA database, and then displayed in the front-end. If analytics were required, the back-end dumped the data to be treated into the file system, and then started the R engine. Results were collected by the back-end and displayed (also stored in cache). As most of the information passed through user filters, this data pipe-lining was preferred over the direct
  • 15. connection of the scripts with the DB. Also this maintained the engine independent of the DB queries in constant development at the main framework. Next steps for the project are planned towards incorporating a version of the required analytics into the DB, so the back- end of the platform can query any of the provided analytics, coded as generic for any kind of input data, as a single SQL query. B. Distributed Environments An alternative architecture for the project would be to upload the data into a Distributed Database (DDB), thinking on expanding the ALOJA database towards Big Data (we still have Terabytes of data to be unpacked into the DB and to be analyzed at this time). The storage of the data could be done using Hive or HBase technologies, also the analytics could be adapted towards Hadoop or Spark. For Hadoop, the package RHadoop could handle the analytics, while the SparkSQL + SparkR packages could do the same for Spark. Processing
  • 16. the analytics could be done on a distributed system with worker nodes, as most of the analytics in our platform can be parallelized. Further, Hadoop and Spark have machine learning libraries (Mahout and MLlib) that could be used as native instead of some functionalities of our ALOJA-ML framework. Figure 2 shows the execution schema of this approach. Such approach would concentrate all the data retrieval and processing in a distributed system, not necessarily near the application but providing parallelism and agility on data browsing, slicing and processing. However, this requires a complex set-up for all the involved services and machines, and the adaption of the analytics towards parallelism in a different level than when using snowfall or other packages. SparkR provides a Distributed Data-Frame structure, with similar properties than regular R data-frames, but due to the distributed property, some classic operations are not available 3 Fig. 2. Schema of distributed execution approach, with
  • 17. Distributed Databases or File Systems or must be performed in a different way (e.g. column binding, aggregates like “mean”, “count”...). This option would be chosen at that point where data is large enough to be dealt with a single system, considering that DB managers do not provide means to distribute and retrieve data. Also, like in the previous approach, the data pipe-line passes through calling the analytics engine to produce the queries and invoke (in a Map-Reduce manner) the analytics when parallelizable, or collect locally data to apply the non- parallelizable analytics. The price to be paid would be the set- up of the platform, and the proper adjustment of the analytics towards a Map-Reduce schema. C. Integrated Analytics The second alternative to the local analytics schema is to incorporate those functions to the DB services. At this point, the DB can offer Analytics as a Service, as users can query for analytics directly to the DB in a SQL manner, and data
  • 18. will be directly retrieved, processed and stored by the same framework, avoiding the user to plan a full data pipe-line. As the presented services and products seem to admit R scripts and programs directly, no adaption of the R code is required. The effort must be put in coding the wrapping procedures for being called from a SQL interface. Figure 3 shows the basic data-flow for this option. Fig. 3. Schema of “In-DB” execution approach The advantages of such approach are that 1) analytics, data and data management are integrated in the same software package, and the only deployment and configuration is of the database; 2) the management of distributed data in Big Data deployments will provided by the same DB software (whether this is implemented as part of the service!), and the user doesn’t need to implement Map-Reduce functions but data can be aggregated in the same SQL query; 3) simple DB users can invoke the analytics though a SQL query, also analytics can be programmed generically to be applied with any required
  • 19. input SQL query result; 4) DBs providing triggers can produce Continuous Analytic Queries each time data is introduced or modified. This option still has issues to be managed, such as the capac- ity of optimization and parallelism of the embedded scripts. In the case of the Microsoft R Service, any R code can be inserted inside a procedure, without any apparent optimization to be applied as any script is directly sent to a external R sub-service managed by the principal DB service. Such R sub-service promises to apply multi-threading in Enterprise editions, as the classic R engine is single-thread, and multi-threading could be applied to vectorization functions like “apply”, without having to load the previously mentioned snowfall (loadable as the script runs on a compatible R engine). Also, comparing this approach to the distributed computing system, if the service supports “partitioning” of data according to determined columns and values, SQL queries could be distributed in a cluster of machines (not as replication, but as distribution), and aggregated after that. IV. ADAPTING AND EMBEDDING CODE
  • 20. According to the SQL-Server published documentation and API, the principal way to introduce external user-defined scripts is to wrap the R code inside a Procedure. The SQL- Server procedures admit the script, that like any “Rscript” can source external R files and load libraries (previously copied into the corresponding R library path set up by the Microsoft R Service), also admit the parameters to be bridged towards the script, and also admit SQL queries to be executed previous to start the script for filling input data-frames. The procedure also defines the return values, in the form of data-frames as tables, values or a tuple of values. When the wrapping procedure is called, input SQL queries are executed and passed as input data-frame parameters, direct parameters are also passed to the script, then the script is executed in the R engine as it would do a “Rscript” or a script dumped into the R command line interface. The variables mapped as outputs are returned from the procedure into the invoking SQL query.
  • 21. Figure 4 shows an example of calling ALOJA-ML functions in charge of learning a linear model from the ALOJA data- set, read from a file while indicating the input and output variables; also calling a function to predict a data-set using the previously created model. The training process creates a model and its hash ID, then the prediction process applies the model to all the testing data-set. In the current ALOJA data pipe-line, “ds” should be retrieved from the DB (to a file to be read, or directly to a variable if RODBC is used). Then 4 source("functions.r"); library(digest); ## Training Process model <- aloja_linreg(ds = read.table("aloja6.csv"), vin = c("maps","iofilebuf"), vout = "exe_time"); id_hash <- digest(x = model, algo = "md5"); # ID for storing the model in DB ## Prediction Example predictions <- aloja_predict_dataset(learned_model = model, ds = read.table("aloja6test.csv"));
  • 22. Fig. 4. Example of modeling and prediction example using the ALOJA-ML libraries. Loading ALOJA-ML functions allow the code to execute “aloja linreg” to model a linear regression, also “aloja predict dataset” to process a new data-set using a previously trained model. the results (“model”, “id hash” and “predictions”) should be reintroduced to the DB using again RODBC, or writing the results into a file and the model into a serialized R object file. Figures 5 and 6 show how this R code is wrapped as procedure, and examples of how these procedures are invoked. Using this schema, in the modeling procedure, “ds” would be passed to the procedure as a SQL query, “vin” and “vout” would be bridged parameters, and the “model” and “id hash” would be returned as a two values (a serialized blob/string and a string) that can be saved into a models table. The return would be introduced into the models table. Also the prediction procedure would admit an “id hash” for retrieving the model (using a SQL query inside the procedure), and would return a data-frame/table with the row IDs and the predictions. We observed that, when preparing the embedded script, all
  • 23. sources, libraries and file paths must be prepared like a Rscript to be executed from a command line. The environment for the script must be set-up, as it will be executed each time in a new R session. In the usual work-flow on ALOJA-ML tools, models (R serialized objects) are stored in the file system, and then uploaded in the database as binary blobs. Working directly from the DB server allow to directly encode serialized objects into available DB formats. Although SQL-Server includes a “large binary” data format, we found some problems when returning binary information from procedures (syntax not allowing the return of such data type in tuples), thus serialized objects can be converted to text-formats like base64 to be stored as a “variable size character array”. V. EXPERIMENTS AND EXPERIENCES A. Usability and Performance For testing the approach we used the SQL-Server 2016 Basic version, with the R Service installed and with Windows Server 2012, from a default image available at the Azure
  • 24. repositories. Being the basic service, and not the enterprise, we assumed that R Services would not provide improvements on multi-threading, so additional packages for such functions are installed (snowfall). The data from the ALOJA data-set is imported through the CSV importing tools in the SQL-Server Manager framework, the “Visual-Studio”-like integrated de- velopment environment (IDE) for managing the services. Data importing displayed some problems, concerning to data-type transformation issues, as some numeric columns couldn’t be properly imported due to precision and very big values, and had to be treated as varchar (thus, not treatable as number). After making sure that the CSV data is properly imported, the IDE allowed to perform simple queries (SE- LECT, INSERT, UPDATE). At this point, tables for storing pair-value entries, containing the models with their specific ID hash as key, is created. Procedures wrapping the basic available functions of ALOJA-ML are created, as specified in previous section IV. After executing some example calls, the system works as expected.
  • 25. During the process of creating the wrapping procedures we found some issues, probably non-reported bugs or system internal limitations, like the fact that a procedure can return a large binary type (large blob in MySQL and similar solutions), also can return tuples of diverse kinds of data types, but it crashed with an internal error when trying to return a tuple of a varchar and a large binary. Workarounds were found by converting the serialized model object (binary type) into a base64 encoding string (varchar type), to be stored with its ID hash key. As none information about this issue was found in the documentation, at the day of registering these experiences, we expect that such issues will be solved by the development team in the future. We initially did some tests using the modeling and pre- diction ALOJA-ML functions over the data, and comparing times with a local “vanilla” R setup, performance is almost identical. This indicating that with this “basic” version, the R Server (former R from Revolution Analytics) is still the same at this point. Another test done was to run the outliers classification function of our framework. The function, explained in detail
  • 26. in its corresponding work [13], compares the original output variables with their predictions, and if the difference is k times greater than the expected standard deviation plus modeling er - ror, and it doesn’t have enough support from similar values on the rest of the data-set, such data is considered an outlier. This implies the constant reading of the data table for prediction and for comparisons between entries. The performance results were similar to the ones on a local execution environment, measuring only the time spent in the function. In a HDI- A2 instance (2 virtual core, only 1 used, 3.5GB memory), 5 %% Creation of the Training Procedure, wrapping the R call CREATE PROCEDURE dbo.MLPredictTrain @inquery nvarchar(max), @varin nvarchar(max), @varout nvarchar(max) AS BEGIN EXECUTE sp_execute_external_script @language = N’R’, @script = N’ source("functions.r"); library(digest); library(base64enc);
  • 27. model <- aloja_linreg(ds = InputDataSet, vin = unlist(strsplit(vin,",")), vout = vout); serial <- as.raw(serialize(model, NULL)); OutputDataSet <- data.frame(model = base64encode(serial), id_hash = digest(serial, algo = "md5")); ’, @input_data_1 = @inquery, @input_data_1_name = N’InputDataSet’, @output_data_1_name = N’OutputDataSet’, @params = N’@vin nvarchar(max), @vout nvarchar(max)’, @vin = @varin, @vout = @varout WITH RESULT SETS (("model" nvarchar(max), "id_hash" nvarchar(50))); END %% Example of creating a model and storing into the DB INSERT INTO aloja.dbo.trained_models (model, id_hash) EXEC dbo.MLPredictTrain @inquery = "SELECT exe_time, maps, iofilebuf FROM aloja.dbo.aloja6", @varin = "maps,iofilebuf", @varout = "exe_time" Fig. 5. Version of the modeling call for ALOJA-ML functions in a SQL-Server procedure. The procedure generates the data-set for “aloja linreg” from a parametrized query, also bridges the rest of parameters into the script. It also indicates the format of the output, being a value, a tuple or a table (data frame). %% Creation of the Predicting Procedure, wrapping the R call CREATE PROCEDURE dbo.MLPredict @inquery nvarchar(max), @id_hash nvarchar(max) AS BEGIN DECLARE @modelt nvarchar(max) = (SELECT TOP 1 model
  • 28. FROM aloja.dbo.trained_models WHERE id_hash = @id_hash); EXECUTE sp_execute_external_script @language = N’R’, @script = N’ source("functions.r"); library(base64enc); results <- aloja_predict_dataset(learned_model = unserialize(as.raw(base64decode(model))), ds = InputDataSet); OutputDataSet <- data.frame(results); ’, @input_data_1 = @inquery, @input_data_1_name = N’InputDataSet’, @output_data_1_name = N’OutputDataSet’, @params = N’@model nvarchar(max)’, @model = @modelt; END %% Example of predicting a dataset from a SQL query with a previously trained model in DB EXEC aloja.dbo.MLPredict @inquery = ’SELECT exe_time, maps, iofilebuf FROM aloja.dbo.aloja6test’, @id_hash = ’aa0279e9d32a2858ade992ab1de8f82e’; Fig. 6. Version of the Prediction call for ALOJA-ML functions in a SQL-Server procedure. Like the training procedure in figure 5, the procedure primarily retrieves the data to be processed from a SQL query, and passes it with the rest of parameters into the script. Here the result is directly a table (data frame).
  • 29. it took 1h:9m:56s to process the 33147 rows, selecting just 3 features. Then, as a way to improve the performance, due to the limitations of the single-thread R Server, we loaded snow- fall, and invoked it from the ALOJA-ML “outlier dataset” function, on a HDI-A8 instance (8 virtual core, all used, 14GB memory). The data-set was processed in 11m:4s, barely 1/7 6 of the previous time, considering the overhead of sharing data among R processes created by snowfall, demonstrating that despite not being a multi-threaded set-up, using the traditional resources available on R it is possible to scale R procedures. B. Discussion One of the concerns on the usage of such service is, despite and because of the capability of multi-processing using built- in or loaded libraries, the management of the pool of R processes. R is not just a scripting language to be embedded on a procedure, but it is a high-level language that allows from creating system calls to parallelizing work among networked
  • 30. working nodes. Given the complexity that a R user created function can achieve, in those cases that such procedure is heavily requested, the R server should be able to be detached from the SQL-server and able in dedicated HPC deployments. The same way snowfall can be deployed for multi-threading (also for cluster-computing), clever hacks can be created by loading RHadoop or SparkR inside a procedure, connecting the script with a distributed processing system. As the SQL- server bridges tables and query results as R data frames, such data frames can be converted to Hadoop’s Resilient Distributed Data-sets or Spark’s Distributed Data Frames, uploaded to a HDFS, processed, then returned to the database. This could bring to a new architecture of SQL-Server (or equivalent solutions) to connect to distributed processing environments, as slave HPC workers for the database. Also an improvement could be that the same DB-server, instead of producing input table/data frames already returned Distributed Data Frames, being the data base distributed into working nodes (in a partitioning way, not a replication way). All in all, the fact
  • 31. that the embedded R code is directly passed to a nearly- independent R engine allows to do whatever a data scientist can do with a typical R session. VI. SUMMARY AND CONCLUSIONS The incorporation of R, the semi-functional programming statistical language, as an embedded analytics service into databases will suppose an improvement on the ease and usability of analytics over any kind of data, from regular to big amounts (Big Data). The capability of data scientists to introduce their analytics functions as a procedure in databases, avoiding complex data-flows from the DB to analytics engines, allow users and experts a quick tool for treating data in-situ and continuously. This study discussed some different architectures for data processing, involving fetching data from DBs, distributing data and processing power, and embedding the data process into the DB. All of this using the ALOJA-ML framework as reference, a framework written in R dedicated to model, predict and
  • 32. classify data from Hadoop executions, stored as the ALOJA data-set. The shown examples and cases of use correspond to the port of the current ALOJA architecture towards SQL- Server 2016, integrating R Services. After testing the analytics functions after the porting into an SQL database, we observed that the major effort for this porting are in the wrapping SQL structures for incorporating the R calls into the DB, without modifying the original R code. As performance results similar than R standalone distributions, the advantages come from the input data retrieving and storing. In future work we plan to test the system more in-depth, and also compare different SQL+R solutions, as companies offer - ing DB products have started putting efforts into integrating R engines in their DB platforms. As this study focused more in an initial hands-on with this new technology, future studies will focus more on comparing performance, also against other architectures for processing Big Data. ACKNOWLEDGMENTS
  • 33. This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 639595). REFERENCES [1] Apache Hadoop. http://hadoop.apache.org (Aug 2016). [2] Apache HBase. https://hbase.apache.org/ (Aug 2016). [3] Apache Hive. https://hive.apache.org/ (Aug 2016). [4] Apache Spark. https://spark.apache.org/ (Aug 2016). [5] Databricks inc. https://databricks.com/ (Aug 2016). [6] ParStream. Cisco corporation. http://www.cisco.com/c/en/us/products/ analytics-automation-software/parstream/index.html (Aug 2016). [7] PureData Systems for Analytics. IBM corporation. https://www-01.ibm. com/software/data/puredata/analytics/ (Aug 2016). [8] R-Server. Microsoft corporation. https://www.microsoft.com/en-us/ cloud-platform/r-server (Aug 2016). [9] RHadoop. Revolution Analytics. https://github.com/ RevolutionAnalytics/RHadoop/wiki (Aug 2016). [10] SQL-Server 2016. Microsoft corporation. https://www.microsoft.com/ en-us/cloud-platform/sql-server (Aug 2016). [11] UC Berkeley, AMPlab. https://amplab.cs.berkeley.edu/
  • 34. (Aug 2016). [12] Barcelona Supercomputing Center. ALOJA home page. http://aloja.bsc. es/ (Aug 2016). [13] J. L. Berral, N. Poggi, D. Carrera, A. Call, R. Reinauer, and D. Green. ALOJA-ML: A framework for automating characterization and knowl- edge discovery in hadoop deployments. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, NSW, Australia, August 10-13, 2015, pages 1701–1710, 2015. [14] D. Borthakur. The Hadoop Distributed File System: Architecture and Design. http://hadoop.apache.org/docs/r0.18.0/hdfs design.pdf. The Apache Software Foundation, 2007. [15] Microsoft Corporation. Azure 4 Research. http://research.microsoft. com/en-us/projects/azure/default.aspx (Jan 2016). [16] N. Poggi, J. L. Berral, D. Carrera, A. Call, F. Gagliardi, R. Reinauer, N. Vujic, D. Green, and J. A. Blakeley. From performance profiling to predictive analytics while evaluating hadoop cost-efficiency in ALOJA. In 2015 IEEE International Conference on Big Data, Big Data
  • 35. 2015, Santa Clara, CA, USA, October 29 - November 1, 2015, pages 1220– 1229, 2015. [17] R Core Team. R: A Language and Environment for Statistical Comput- ing. R Foundation for Statistical Computing, Vienna, Austria, 2014. 7 Managing Big Data Analytics Workflows with a Database System Carlos Ordonez Javier Garcı́a-Garcı́a University of Houston UNAM University USA Mexico Abstract—A big data analytics workflow is long and complex, with many programs, tools and scripts interacting together. In general, in modern organizations there is a significant amount of big data analytics processing performed outside a database system, which creates many issues to manage and process big data analytics workflows. In general, data preprocessing is the most time-consuming task in a big data analytics workflow. In this work, we defend the idea of preprocessing, computing models and scoring data sets inside a database system. In addition, we discuss recommendations and experiences to improve big data
  • 36. analytics workflows by pushing data preprocessing (i.e. data cleaning, aggregation and column transformation) into a database system. We present a discussion of practical issues and common solutions when transforming and preparing data sets to improve big data analytics workflows. As a case study validation, based on experience from real-life big data analytics projects, we compare pros and cons between running big data analytics workflows inside and outside the database system. We highlight which tasks in a big data analytics workflow are easier to manage and faster when processed by the database system, compared to external processing. I. INTRODUCTION In a modern organization transaction processing [4] and data warehousing [6] are managed by database systems. On the other hand, big data analytics projects are a different story. Despite existing data mining [5], [6] functionality offered by most database systems many statistical tasks are generally performed outside the database system [7], [6], [15]. This is due to the existence of sophisticated statistical tools and libraries, the lack of expertise of statistical analysts to write correct and efficient SQL queries, a somewhat limited set of
  • 37. statistical algorithms and techniques in the database system (compared to statistical packages) and abundance of legacy code (generally well tuned and difficult to rewrite in a different language). In such scenario, users exploit the database system just to extract data with ad-hoc SQL queries. Once large tables are exported they are summarized and further transformed depending on the task at hand, but outside the database system. Finally, when the data set has the desired variables (features), statistical models are computed, interpreted and tuned. In general, preprocessing data for analysis is the most time consuming task [5], [16], [17] because it requires cleaning, joining, aggregating and transforming files, tables to obtain “analytic” data sets appropriate for cube or machine learning processing. Unfortunately, in such environments manipulating data sets outside the database system creates many data man- agement issues: data sets must be recreated and re-exported every time there is a change, models need to be imported back into the database system and then deployed, different
  • 38. users may have inconsistent versions of the same data set in their own computers, security is compromised. Therefore, we defend the thesis that it is better to transform data sets and compute statistical models inside the database system. We provide evidence such approach improves management and processing of big data analytics workflows. We argue database system-centric big data analytics workflows are a promising processing approach for large organizations. The experiences reported in this work are based on analytics projects where the first author participated as as developer and consultant in a major DBMS company. Based on ex- perience from big data analytics projects we have become aware statistical analysts cannot write correct and efficient SQL code to extract data from the database system or to transform their data sets for the big data analytics task at hand. To overcome such limitations, statistical analysts use big data analytics tools, statistical software, data mining packages to manipulate and transform their data sets. Consequently, users
  • 39. end up creating a complicated collection of programs mixing data transformation and statistical modeling tasks together. In a typical big data analytics workflow most of the development (programming) effort is spent on transforming the data set: this is the main aspect studied in this article. The data set represents the core element in a big data analytics workflow: all data transformation and modeling tasks must work on the entire data set. We present evidence running workflows entirely inside a database system improves workflow management and accelerates workflow processing. The article is organized as follows. Section II provides the specification of the most common big data analytics workflows. Section III discusses practical issues and solutions when preparing and transforming data sets inside a database system. We contrast processing workflows inside and outside the database system. Section IV compares advantages and time performance of processing big data analytics workflows inside the database system and outside exploiting with external
  • 40. data mining tools. Such comparisons are based on experience from real-life data mining projects. Section V discusses related work. Section VI presents conclusions and research issues for 2016 16th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing 978-1-5090-2453-7/16 $31.00 © 2016 IEEE DOI 10.1109/CCGrid.2016.63 649 future work. II. BIG DATA ANALYTICS WORKFLOWS We consider big data analytics workflows in which Task i must be finished before Task i + 1. Processing is mostly sequential from task to task, but it is feasible to have cycles from Task i back to Task 1 because data mining is an iterative process. We distinguish two complementary big data analytics workflows: (1) computing statistical models; (2) scoring data sets based on a model. In general, statistical models are computed on a new data set, whereas scoring takes place after the model is well understood and an acceptable data mining
  • 41. has been produced. A typical big data analytics workflow to compute a model consists of the following major tasks (this basic workflow can vary depending on users knowledge and focus): 1) Data preprocessing: Building data sets for analysis (se- lecting records, denormalization and aggregation) 2) Exploratory analysis (descriptive statistics, OLAP) 3) Computing statistical models 4) Tuning models: add, modify data set attributes, fea- ture/variable selection, train/test models 5) Create scoring scripts based on selected features and best model This modeling workflow tends to be linear (Task i executed before Task i+1), but in general it has cycles. This is because there are many iterations building data sets, analyzing variable behavior and computing models before and acceptable can be obtained. Initially the bottleneck of the workflow is data preprocessing. Later, when the data mining project evolves most workflow processing is testing and tuning the desired
  • 42. model. In general, the database system performs some denor - malization and aggregation, but it is common to perform most transformations outside the database system. When features need to be added or modified queries need to be recomputed, going back to Task 1. Therefore, only a part of the first task of the workflow is computed inside the database system; the remaining stages of the workflow are computed with external data mining tools. The second workflow we consider is actually deploying a model with arriving (new) data. This task is called scoring [7], [12]. Since the best features and the best model are already known this is generally processed inside the database system. A scoring workflow is linear and non-iterative: Processing starts with Task 1, Task i is executed before Task i + 1 and processing ends with the last task. In a scoring workflow tasks are as follows: 1) Data preprocessing: Building data set for scoring (select records, compute best features).
  • 43. 2) Deploy model on test data set (e.g predict class or target value), store model output per record as a new table or view in the database system. 3) Evaluate queries and produce summary reports on scored data sets joined with source tables. Explain models by tracing data back to source tables in the database system. Building the data set tends to be most important task in both workflows since it requires gathering data from many source tables. In this work, we defend the idea of preprocessing big data inside a database system. This approach makes sense only when the raw data originates from a data warehouse. That is, we do not tackle the problem of migrating processing of external data sets into the database system. III. TASKS TO PREPARE DATA SETS We discuss issues and recommendations to improve the data processing stage in a big data analytics workflow. Our main motivation is to bypass the use of external tools to perform data preprocessing, using the SQL language instead.
  • 44. A. Reasons for Processing big data analytics workflows Out- side the database system In general, the main reason is of a practical nature: users do not want to translate existing code working on external tools. Commonly such code has existed for a long time (legacy programs), it is extensive (there exist many programs) and it has been debugged and tuned. Therefore, users are reluctant to rewrite it in a different language, given associated risks. A second complaint is that, in general, the database system provides elementary statistical functionality, compared to sophisticated statistical packages. Nevertheless, such gap has been shrinking over the years. As explained before, in a data mining environment most user time is spent on preparing data sets for analytic purposes. We discuss some of the most important database issues when tables are manipulated and transformed to prepare a data set for data mining or statistical analysis. B. Main Data Preprocessing Tasks We consider three major tasks:
  • 45. 1) selecting records for analysis (filtering) 2) denormalization (including math transformation) 3) aggregation Selecting Records for Analysis: Selection of rows can be done at several stages, in different tables. Such filtering is done to discard outliers, to discard records with a significant missing information content (including referential integrity [14]), to discard records whose potential contribution to the model provides no insight or sets of records whose characteristics deserve separate analysis. It is well known that pushing selection is the basic strategy to accelerate SPJ (select-project- join) queries, but it is not straightforward to apply into multiple queries. A common solution we have used is to perform as much filtering as possible on one data set. This makes code maintenance easier and the query optimizer is able to exploit filtering predicates as much as possible. In general, for a data mining task it is necessary to select a set of records from one of the largest tables in the database
  • 46. based on a date range. In general, this selection requires a scan on the entire table, which is slow. When there is a secondary index based on date it is more efficient to select rows, but it is not always available. The basic issue is that such 650 transaction table is much larger compared to other tables in the database. For instance, this time window defines a set of active customers or bank accounts that have recent activity. Common solutions to this problem include creating materialized views (avoiding join recomputation on large tables) and lean tables with primary keys of the object being analyzed (record, product, etc) to act as filter in further data preparation tasks. Denormalization: In general, it is necessary to gather data from many tables and store data elements in one place. It is well-known that on-line transaction processing (OLTP) database systems update normalized tables. Normalization makes transaction processing faster and ACID [4] semantics are easier to ensure. Queries that retrieve a few records
  • 47. from normalized tables are relatively fast. On the other hand, analysis on the database requires precisely the opposite: a large set of records is required and such records gather information from many tables. Such processing typically involves complex queries involving joins and aggregations. Therefore, normal - ization works against efficiently building data sets for analytic purposes. One solution is to keep a few key denormalized tables from which specialized tables can be built. In general, such tables cannot be dynamically maintained because they involve join computation with large tables. Therefore, they are periodically recreated as a batch process. The query shown below builds a denormalized table from which several aggregations can be computed (e.g. similar to cube queries). SELECT customer_id ,customer_name ,product.product_id ,product_name ,department.department_id ,department_name FROM sales
  • 48. JOIN product ON sales.product_id=product.product_id JOIN department ON product.department_id =department.department_id; For analytic purposes it is always best to use as much data as possible. There are strong reasons for this. Statistical models are more reliable, it is easier to deal with missing information, skewed distributions, discover outliers and so on, when there is a large data set at hand. In a large database with tables coming from a normalized database being joined with tables used in the past for analytic purposes may involve joins with records whose foreign keys may not be found in some table. That is, natural joins may discard potentially useful records. The net effect of this issue is that the resulting data set does not include all potential objects (e.g. records, products). The solution is define a universe data set containing all objects gathered with union from all tables and then use such table as the fundamental table to perform outer joins. For
  • 49. simplicity and elegance, left outer joins are preferred. Then left outer joins are propagated everywhere in data preparation and completeness of records is guaranteed. In general such left outer joins have a “star” form on the joining conditions, where the primary key of the master table is left joined with the primary keys of the other tables, instead of joining them with chained conditions (FK of table T1 is joined with PK of table T2, FK of table T2 is joined with PK of T3, and so on). The query below computes a global data set built from individual data sets. Notice data sets may or may not overlap each other. SELECT ,record_id ,T1.A1 ,T2.A2 .. ,Tk.Ak FROM T_0 LEFT JOIN T1 ON T_0.record_id= T1.record_id LEFT JOIN T2 ON T_0.record_id= T2.record_id .. LEFT JOIN Tk ON T_0.record_id= Tk.record_id; Aggregation: Unfortunately, most data mining tasks require
  • 50. dimensions (variables) that are not readily available from the database. Such dimensions typically require computing aggregations at several granularity levels. This is because most columns required by statistical or machine learning techniques require measures (or metrics), which translate as sums or counts computed with SQL. Unfortunatel y, granularity levels are not hierarchical (like cubes or OLAP [4]) making the use of separate summary tables necessary (e.g. summarization by product or by customer, in a retail database). A straightforward optimization is to compute as many dimensions in the same statement exploiting the same group-by clause, when possible. In general, for a statistical analyst it is best to create as many variables (dimensions) as possible in order to isolate those that can help build a more accurate model. Then summariza- tion tends to create tables with hundreds of columns, which make query processing slower. However, most state-of-the- art statistical and machine learning techniques are designed to perform variable (feature) selection [7], [10] and many of
  • 51. those columns end up being discarded. A typical query to derive dimensions from a transaction table is as follows: SELECT customer_id ,count(*) AS cntItems ,sum(salesAmt) AS totalSales ,sum(case when salesAmt<0 then 1 end) AS cntReturns FROM sales GROUP BY customer_id; Transaction tables generally have two or even more levels of detail, sharing some columns in their primary key. The typical example is store transaction table with individual items and the total count of items and total amount paid. This means that many times it is not possible to perform a statistical analysis only from one table. There may be unique pieces of information at each level. Therefore, such large 651 tables need to be joined with each other and then aggregated at the appropriate granularity level, depending on the data
  • 52. mining task at hand. In general, such queries are optimized by indexing both tables on their common columns so that hash-joins can be used. Combining Denormalization and Aggregation: Multiple pri- mary keys: A last aspect is considering aggregation and denormalization interacting together when there are multiple primary keys. In a large database different subsets of tables have different primary keys. In other words, such tables are not compatible with each other to perform further aggregation. The key issue is that at some point large tables with different primary keys must be joined and summarized. Join operations will be slow because indexing involves foreign keys with large cardinalities. Two solutions are common: creating a secondary index on the alternative primary key of the largest table, or creating a denormalized table having both primary keys in order to enable fast join processing. For instance, consider a data mining project in a bank that requires analysis by customer id, but also account id. One customer may have multiple accounts. An account may have
  • 53. multiple account holders. Joining and manipulating such tables is challenging given their sizes. C. Workflow Processing Dependent SQL statements: A data transformation script is a long sequence of SELECT statements. Their dependencies are complex, although there exists a partial order defined by the order in which temporary tables and data sets for analysis are created. To debug SQL code it is a bad idea to create a single query with multiple query blocks. In other words, such SQL statements are not amenable to the query optimizer because they are separate, unless it can keep track of historic usage patterns of queries. A common solution is to create intermediate tables that can be shared by several statements. Those intermediate tables commonly have columns that can later be aggregated at the appropriate granularity levels. Computer resource usage: This aspect includes both disk and CPU usage, with the second one being a more valuable resource. This problem gets compounded by the fact that most data mining tasks work on the entire data set or large
  • 54. subsets from it. In an active database environment running data preparation tasks during peak usage hours can degrade performance since, generally speaking, large tables are read and large tables are created. Therefore, it is necessary to use workload management tools to optimize queries from several users together. In general, the solution is to give data preparation tasks a lower priority than the priority for queries from interactive users. On a longer term strategy, it is best to organize data mining projects around common data sets, but such goal is difficult to reach given the mathematical nature of analysis and the ever changing nature of variables (dimensions) in the data sets. Comparing views and temporary tables: Views provide limited control on storage and indexing. It may be better to create temporary tables, especially when there are many primary keys used in summarization. Nevertheless, disk space usage grows fast and such tables/views need to be refreshed when new records are inserted or new variables (dimensions) are created.
  • 55. Scoring: Even though many models are built outside the database system with statistical packages and data mining tools, in the end the model must be applied inside the database system [6]. When volumes of data are not large it is feasible to perform model deployment outside: exporting data sets, applying the model and building reports can be done in no more than a few minutes. However, as data volume increases exporting data from the database system becomes a bottleneck. This problem gets compounded with results interpretation when it is necessary to relate statistical numbers back to the original tables in the database. Therefore, it is common to build models outside, frequently based on samples, and then once an acceptable model is obtained, then it is imported back into the database system. Nowadays, model deployment basically happens in two ways: using SQL queries if the mathematical computations are relatively simple or with UDFs [2], [12] if the computations are more sophisticated. In most cases, such scoring process can work in a single table scan, providing
  • 56. good performance. IV. COMPARING PROCESSING OF WORKFLOWS INSIDE AND OUTSIDE A DATABASE SYSTEM In this section we present a qualitative and quantitative comparison between running big data analytics workflows inside and outside the database system. This discussion is a summary of representative successful projects. We first discuss a typical database environment. Second, we present a summary of the data mining projects presenting their practical application and the statistical and data mining techniques used. Third, we discuss advantages and accomplishments for each workflow running inside the database system, as well as the main objections or concerns against such approach (i.e. migrating external transformation code into the database system). Fourth, we present time measurements taken from actual projects at each organization, running big data analytics workflows completely inside and partially outside the database system.
  • 57. A. Data Warehousing Environment The environment was a data warehouse, where several databases were already integrated into a large enterprise-wide database. The database server was surrounded by specialized servers performing OLAP and statistical analysis. One of those servers was a statistical server with a fast network connection to the database server. First of all, an entire set of statistical language programs were translated into SQL using Teradata data mining program, the translator tool and customized SQL code. Second, in every case the data sets were verified to have the same contents in the statistical language and SQL. In most cases, the numeric output from statistical and machine learning models was the same, but sometimes there were slight numeric differences, 652 given variations in algorithmic improvements and advanced parameters (e.g. epsilon for convergence, step-wise regression
  • 58. procedures, pruning method in decision tree and so on). B. Organizations: statistical models and business application We now give a brief discussion about the organizations where the statistical code migration took place. We also discuss the specific type of data mining techniques used in each case. Due to privacy concerns we omit discussion of specific information about each organization, their databases and the hardware configuration of their database system servers. The big data analytics workflows were executed on the organizations database servers, concurrently with other users (analysts, managers, DBAs, and so on). We now describe the computers processing the workflow in more detail. The database system server was, in general, a parallel multiprocessor computer with a large number of CPUs, ample memory per CPU and several terabytes of parallel disk storage in high performance RAID configurations. On the other hand, the statistical server was generally a smaller computer with less than 500 GB of disk space with ample
  • 59. memory space. Statistical and data mining analysis inside the database system was performed only with SQL. In general, a workstation connected to each server with appropriate client utilities. The connection to the database system was done with ODBC. All time measurements discussed herein were taken on 32-bit CPUs over the course of several years. Therefore, they cannot be compared with each other and they should only be used to understand performance gains within the same organization. The first organization was an insurance company. The data mining goal involved segmenting customers into tiers according to their profitability based on demographic data, billing information and claims. The statistical techniques used to determine segments involved histograms and clustering. The final data set had about n = 300k records and d = 25 variables. There were four segments, categorizing customers from best to worst. The second organization was a cellular telephone service provider. The data mining task involved predicting which
  • 60. customers were likely to upgrade their call service package or purchase a new handset. The default technique was logistic regression [7] with stepwise procedure for variable selection. The data set used for scoring had about n = 10M records and d = 120 variables. The predicted variable was binary. The third organization was an Internet Service Provider (ISP). The predictive task was to detect which customers were likely to disconnect service within a time window of a few months, based on their demographic data, billing information and service usage. The statistical techniques used in this case were decision trees and logistic regression and the predicted variable was binary. The final data set had n = 3.5M records and d = 50 variables. C. Database system-centric big data analytics workflows: Users Opinion We summarize pros and cons about running big data analyt- ics workflows inside and outside the database system. Table I contains a summary of outcomes. As we can see performance to score data sets and transforming data sets are positive
  • 61. outcomes in every case. Building the models faster turned out not be as important because users relied on sampling to build models and several samples were collected to tune and test models. Since all databases and servers were within the same firewall security was not a major concern. In general, improving data management was not seen as major concern because there existed a data warehouse, but users acknowledge a “distributed” analytic environment could be a potential management issue. We now summarize the main objections, despite the advantages discussed above. We exclude cost as a decisive factor to preserve anonymity of users opinion and give an unbiased discussion. First, many users preferred a traditional programming language like Java or C++ instead of a set-oriented language like SQL. Second, some specialized techniques are not available in the database system due to their mathematical complexity; relevant examples include Support Vector Machines, Non-linear regression and time series mod- els. Finally, sampling is a standard mechanism to analyze large
  • 62. data sets. D. Workflow Processing Time Comparison We compare workflow processing time inside the database system using SQL and UDFs and outside the database system, using an external server transforming exported data sets and computing models. In general, the external server ran existing data transformation programs developed by each organization. On the other hand, each organization had diverse data mining tools that analyzed flat files. We must mention the comparison is not fair because the database system server was in general a powerful parallel computer and the external server was a smaller computer. Nevertheless, such setting represents a com- mon IT environment where the fastest computer is precisely the database system server. We discuss tables from the database in more detail. There were several input tables coming from a large normalized database that were transformed and denormalized to build data sets used by statistical or machine learning techniques.
  • 63. In short, the input were tables and the output were tables as well. No data sets were exported in this case: all processing happened inside the database system. On the other hand, analysis on the external server relied on SQL queries to extract data from the database system, transform the data to produce data sets in the statistical server and then building models or scoring data sets based on a model. In general, data extraction from the database system was performed using bulk utilities which exported data records in blocks. Clearly, there was an export bottleneck from the database system to the external server. Table II compares performance between both workflow pro- cessing alternatives: inside and outside the database system. As 653 TABLE I ADVANTAGES AND DISADVANTAGES ON RUNNING BIG DATA ANALYTICS WORKFLOWS INSIDE A DATABASE SYSTEM. Insur. Phone ISP
  • 64. Advantages: Improve Workflow Management X X X Accelerate Workflow Execution X Prepare data sets more easily X X X Compute models faster without sampling X Score data sets faster X X X Enforce database security X X Disadvantages: Workflow output independent X X Workflow outside database system OK X X Prefer program. lang. over SQL X X Samples accelerate modeling X X database system lacks statistical models X Legacy code X TABLE II WORKFLOW PROCESSING TIME INSIDE AND OUTSIDE A DATABASE SYSTEM (TIME IN MINUTES). Task outside inside Build model: Segmentation 2 1 Predict propensity 38 8 Predict churn 120 20 Score data set: Segmentation 5 1 Predict propensity 150 2 Predict churn 10 1 TABLE III TIME TO COMPUTE MODELS INSIDE THE DATABASE SYSTEM AND TIME TO EXPORT DATA SET (SECS).
  • 65. n × 1000 d SQL/UDF BULK ODBC 100 8 4 17 168 100 16 5 32 311 100 32 6 63 615 100 64 8 121 1204 1000 8 40 164 1690 1000 16 51 319 3112 1000 32 62 622 6160 1000 64 78 1188 12010 introduced in Section II we distinguish two big data analytics workflows: computing the model and scoring the model on large data sets. The times shown in Table II include the time to transform the data set with joins and aggregations the time to compute the model and the time to score applying the best model. As we can see the database system is significantly faster. We must mention that to build the predictive models both approaches exploited samples from a large data set. Then the models were tuned with further samples. To score data sets the gap is wider, highlighting the efficiency of SQL to compute joins and aggregations to build the data set and then computing mathematical equations on the data set. In general,
  • 66. the main reason the external server was slower was the time to export data from the database system. A secondary reason was its more limited computing power. Table III compares processing time to compute a model inside the database system and the time to export the data set (with a bulk utility and ODBC). In this case the database system runs on a relatively small computer with 3.2 GHz, 4 GB on memory and 1 TB on disk. The models include PCA, Naive Bayes and linear regression, which can be derived from the correlation matrix [7] of the data set in a single table scan using SQL queries and UDFs. Exporting the data set is a bottleneck to perform data mining processing outside the database system, regardless of how fast the external server is. However, exporting a sample from the data set can be done fast, but analyzing a large data set without sampling is much faster to do inside the database system. Also, sampling can also be exploited inside the database system. V. RELATED WORK
  • 67. There exist many proposals that extend SQL with data mining functionality. Most proposals add syntax to SQL and optimize queries using the proposed extensions. Several techniques to execute aggregate UDFs in parallel are studied in [9]; these ideas are currently used by modern parallel relational database systems such as Teradata. SQL extensions to define, query and deploy data mining models are proposed in [11]; such extensions provide a friendly language interface to manage data mining models. This proposal focuses on man- aging models rather than computing them and therefore such extensions are complementary to UDFs. Query optimization techniques and a simple SQL syntax extension to compute multidimensional histograms are proposed in [8], where a multiple grouping clause is optimized. Some related work on exploiting SQL for data manipulation tasks includes the following. Data mining primitive operators are proposed in [1], including an operator to pivot a table and another one for sampling, useful to build data sets. The
  • 68. pivot/unpivot operators are extremely useful to transpose and transform data sets for data mining and OLAP tasks [3], but they have not been standardized. Horizontal aggregations were proposed to create tabular data sets [13], as required by statistical and machine learning techniques, combining pivoting and aggregation in one function. For the most part research work on preparing data sets for analytic purposes in a relational database system remains scarce. Data quality is a fundamental aspect in a data mining application. Referential integrity quality metrics are proposed in [14], where users can 654 isolate tables and columns with invalid foreign keys. Such referential problems must be solved in order to build a data set without missing information. Mining workflow logs, a different problem, has received at- tention [18]; the basic idea is to discover patterns in workflow process logs. To the best of our knowledge, there is scarce
  • 69. research work dealing with migrating data preprocessing into a database system to improve management and processing of big data analytics workflows. VI. CONCLUSIONS We presented practical issues and discussed common solu- tions to push data preprocessing into a database system to improve management and processing of big data analytics workflows. It is important to emphasize data preprocessing is generally the most time consuming and error-prone task in a big data analytics workflow. We identified specific data preprocessing issues. Summarization generally has to be done at different granularity levels and such levels are generally not hierarchical. Rows are selected based on a time window, which requires indexes on date columns. Row selection (filtering) with complex predicates happens on many tables, making code maintenance and query optimization difficult. In general, it is necessary to create a “universe” data set to define left outer joins. Model deployment requires importing models as SQL
  • 70. queries or UDFs to deploy a model on large data sets. Based on experience from real-life projects, we compared advantages and disadvantages when running big data analytics workflows entirely inside and outside a database system. Our general observations are the following. big data analytics workflows are easier to manage inside a database system. A big data analytics workflow is generally faster to run inside a database system assuming a data warehousing environment (i.e. data originates from the database). From a practical perspective workflows are easier to manage inside a database system because users can exploit the extensive capabilities of the database system (querying, recovery, security and concurrency control) and there is less data redundancy. However, external statistical tools may provide more flexibility than SQL and more statistical techniques. From an efficiency (processing time) perspective, transforming and scoring data sets and computing models are faster to develop and run inside the database system. Nevertheless, sampling represent a practical
  • 71. solution to accelerate big data analytics workflows running outside a database system. Improving and optimizing the management and processing of big data analytics workflows provides many opportunities for future work. It is necessary to specify models which take into account processing with external data mining tools. The data set tends to be the bottleneck in a big data analytics workflow both from a programming and processing time perspectives. Therefore, it is necessary to specify workflow models which can serve as templates to prepare data sets. The role of the statistical model in a workflow needs to studied in the context of the source tables that were used to build the data set; very few organizations reach that stage. REFERENCES [1] J. Clear, D. Dunn, B. Harvey, M.L. Heytens, and P. Lohman. Non-stop SQL/MX primitives for knowledge discovery. In ACM KDD Conference, pages 425–429, 1999. [2] J. Cohen, B. Dolan, M. Dunlap, J. Hellerstein, and C. Welton. MAD
  • 72. skills: New analysis practices for big data. In Proc. VLDB Conference, pages 1481–1492, 2009. [3] C. Cunningham, G. Graefe, and C.A. Galindo-Legaria. PIVOT and UNPIVOT: Optimization and execution strategies in an RDBMS. In Proc. VLDB Conference, pages 998–1009, 2004. [4] R. Elmasri and S.B. Navathe. Fundamentals of Database Systems. Addison-Wesley, 4th edition, 2003. [5] U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth. The KDD process for extracting useful knowledge from volumes of data. Communications of the ACM, 39(11):27–34, November 1996. [6] J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, San Francisco, 2nd edition, 2006. [7] T. Hastie, R. Tibshirani, and J.H. Friedman. The Elements of Statistical Learning. Springer, New York, 1st edition, 2001. [8] A. Hinneburg, D. Habich, and W. Lehner. Combi-operator- database support for data mining applications. In Proc. VLDB Conference, pages 429–439, 2003. [9] M. Jaedicke and B. Mitschang. On parallel processing of aggregate
  • 73. and scalar functions in object-relational DBMS. In ACM SIGMOD Conference, pages 379–389, 1998. [10] T.M. Mitchell. Machine Learning. Mac-Graw Hill, New York, 1997. [11] A. Netz, S. Chaudhuri, U. Fayyad, and J. Berhardt. Integrating data mining with SQL databases: OLE DB for data mining. In Proc. IEEE ICDE Conference, pages 379–387, 2001. [12] C. Ordonez. Statistical model computation with UDFs. IEEE Transac- tions on Knowledge and Data Engineering (TKDE), 22(12):1752–1765, 2010. [13] C. Ordonez and Z. Chen. Horizontal aggregations in SQL to prepare data sets for data mining analysis. IEEE Transactions on Knowledge and Data Engineering (TKDE), 24(4):678–691, 2012. [14] C. Ordonez and J. Garcı́a-Garcı́a. Referential integrity quality metrics. Decision Support Systems Journal, 44(2):495–508, 2008. [15] C. Ordonez and J. Garcı́a-Garcı́a. Database systems research on data mining. In Proc. ACM SIGMOD Conference, pages 1253–1254, 2010. [16] D. Pyle. Data preparation for data mining. Morgan Kaufmann
  • 74. Publishers Inc., San Francisco, CA, USA, 1999. [17] E. Rahm and D. Hong-Hai. Data cleaning: Problems and current approaches. IEEE Bulletin of the Technical Committee on Data En- gineering, 23(4), 2000. [18] W. M. P. van der Aalst, B. F. van Dongen, J. Herbst, L. Maruster, G. Schimm, and A. J. M. M. Weijters. Workflow mining: a survey of issues and approaches. Data & Knowledge Engineering, 47(2):237–267, 2003. 655