1. A Compatrive Study of ETL Tools
Sana Yousuf
Department of Computer Science
Military College of Signals, National University of
Sciences & Technology
Islamabad, Pakistan
sn_ysf@yahoo.com
Sanam Shahla Rizvi
Department of Computer Science
Military College of Signals, National University of
Sciences & Technology
Islamabad, Pakistan
ssrizvi@mcs.edu.pk
Abstract—In many organizations valuable data is wasted
because it lies around in different formats and in various
resources. Data warehouses (DWs) are complex systems having
consolidated data with an objective to assist the knowledge
workers in decision making process. The key components of
DWs are the Extraction-Transformation-Loading (ETL)
processes. Since incorrect or misleading data may produce
wrong decisions. This necessitates the selection of appropriate
ETL Tools for a DW to improve data quality. The selection of
ETL tool is a complex and important issue in data
warehousing because it validates the quality of a data
warehouse. This paper first highlights the ETL process briefly
then discuses some of the ETL tools available along with a
general criterion used as measuring parameters for selecting
appropriate ETL tools. At the end an analysis of the tools
based on the generalized criteria is presented to give an insight
of which tool is better for which circumstance.
Keywords: Dataware houses, ETL tools, complex systems,
enterprise systems
I. INTRODUCTION
Data Warehouse is a large data repository that
consolidates various types of data transformed into a single
suitable format. Depending on specific business needs it can
be architectured differently. However in general data stored
in operational databases is transferred to a data ware house
pre processing platform also known as staging area, then
after processing into the data ware house and lastly is
transformed into sets of conformed data marts
A. ETL Process and Concepts
Extract, Transform and Load (ETL), is an important
component of the Data Warehousing Architecture. The
process includes extraction of data from various data
sources, transformation of extracted data according to
business requirements and loading of that data into the
dataware house.
Any programming language can be used to make an ETL
process however making it from bits and pieces is quite
complex. Various ETL tools are available in the market
easing an enterprise to select one based on its requirements
& needs. With the passage of time these tools have matured
and now provide much more than just Extraction,
transformation and loading of data. The improvements
include capabilities such as “data profiling, data quality
control, monitoring and cleansing, real-time and on-demand
data integration in a service oriented architecture, and
metadata management” [12]. Moreover ETL tools are now
customizable according to the functional requirements of an
enterprise data warehouse.
a) Extraction
Being the first step in the ETL process its focus is on
extracting data from different source systems. These sources
are named as source system because they could be internal,
external, structured or unstructured i.e. of any type. Thus
sources systems could be mainframe applications, flat files,
ERP applications, relational databases, non-relational
databases, CRM tools or even message queues. These
sources may have different formats of data i.e. different
internal representation making Extraction a difficult process.
So an extraction tool should be able to :
- Understand all different data storage formats
- Have a communicative ability among various
relational databases
- Read & understand different file formats used in an
organization.
- Extract only relevant data before bringing it into to
the DW.
b) Transformation
The transformation phase ensures the data consistency
and performs data cleansing before loading data in the data
warehouse. In order to transform the data properly, a number
of rules and business calculations are applied to the extracted
data so that different data formats are mapped into a single
format. Transformation can be integrated with extraction or
loading phase depending upon when it is performed.
c) Loading
After transforming and cleansing the extracted data, it is
loaded into fact and dimension tables of the data warehouse
to be used for various analytical purposes. It is done
regularly to avoid data stacks to get piled up. It can be
required in one of the two situations:
- Load the new data that is currently contained in the
operational database
2. - Load the updates corresponding to the changes
occurred in the operational database
“Reference [3] states that incremental loading is the
preferred approach to data warehouse refreshment because it
generally reduces the amount of data that has to be extracted,
transformed, and loaded by the ETL system. ETL jobs for
incremental loading require access to source data that has
been changed since the previous loading cycle. For this
purpose, so called Change Data Capture (CDC) mechanisms
at the sources can be exploited, if available. Additionally,
ETL jobs for incremental loading potentially require access
to the overall data content of the operational sources.”
The paper provides an insight to the background of ETL
tools in following section. Section III presents brief overview
of the various ETL tools. Section IV focuses on setting the
criteria to rank available tools. Section V on the other hand
presents a comparative analysis of various tools. Paper is
ended by a conclusion of the overall study in section VI.
II. BACKGROUND OF ETL TOOLS
An ETL tool provides a certain set of basic ETL
processing facilities, as explained in section I, to rank it as a
proper ETL tool. Since 2003 Passionned, a consultancy and
research firm, has been closely monitoring the market for
both ETL and data integration tools [4]. Earlier the surveys
conducted were based on the main market driving entities
also known as visionaries. Many organizations used to
assume that they had automatically made the right choice if
they purchased a tool from one of the market leaders.
However the trend changed over time and then organizations
started making ETL tools for according to their requirements
themselves.
Since the late nineties, all the major business intelligence
(BI) vendors had purchased or developed their own ETL
tools. BI tools had more reliable ETL processes and a well
designed method of keeping the data warehouse. BI provided
a better solution but it consumed 70 -80% of the costs
involved in a successful BI system.
Passionned in its ETL Tools survey 2009 described the
importance to evaluate and promote ETL tools because many
organizations still built their data warehouses by hand i.e.
writing complex PL/SQL or SQL and stored procedures. The
focus of such surveyors was that developer productivity
would be increased by a factor of 3-5 times if a proper ETL
tool was used. Thus if a proper guidance was available to
enterprises then choosing the right product would become
easier and less risking for he organization itself. As
explained by reference [5] construction of data ware houses
through ETL tools resulted in a better, stable and more
reliable data-ware house that allowed more aspects to be
checked and monitored in relation to each other. Companies
on their own official websites also present a comparison of
their offered product with other market competitors; Adeptia
[10], Microsoft SSIS and informatica [3] are such examples.
III. SOME FAMOUS ETL TOOLS
Some famous ETL tools available in market are as follows:
A. Pentaho Data Integration
Pentaho [12] is a commercial open-source Business
Intelligence suite along with a data integration product
named Kettle. Using the innovative meta-driven approach it
is fast having an easy to use GUI. Having started in 2001 it
has grown and today it has a strong community of 13,500
registered users. It also supports multi-format data and
allows data movement between many different databases and
files.
B. Talend Open Studio
Talend Open Studio (TOS) [10]is another tool with
support of data integration and is open source. Started in
2006, has a less community of followers but still has quite a
market share as 2 supporters are finance companies. Rather
than metadata driven it uses a code driven approach and has
a GUI for user interaction. The code generation property
allows generating executable code of Java and Perl that can
be run later on a server.
C. Informatica Power Center
Informatica Power Center (IPC) [3] is not an open source
software but is commercially a recommended data
integration suite and thus the market share leader in data
integration tools. Found in 1993, it has made its place in
market with consistency and leadership, today it has 2600
registered users out of which 100 are included in list of stock
exchange companies. The main focus of IPC is on data
integration with numerous capabilities e.g. enterprise size
architecture, data cleansing, data profiling, web servicing and
interoperability with current and legacy systems.
D. Inaplex Inaport
Inaplex [12] provides mid-market solutions focusing
customer relationship management for customers’ data
integration. Besides the customer relationship management it
also lays emphasis on providing simple solutions for data
integration and accountancy handling.
E. Oracle Warehouse Builder
The Oracle Warehouse Builder (OWB) [13] is “a
comprehensive tool for ETL, relational and dimensional
modeling, data quality, data auditing, and full lifecycle
management of data and metadata” [13]. It allows high
performance, security and scalability by having Oracle DB
as the metadata repository and transformation engine.
F. IBM Information Server
A product by IBM (IS Datastage) [10] & is well known
for its services. The capabilities of the tool include data
consolidation, synchronization, and distribution across
disparate databases, automatic data profiling & analysis in
terms of content and structure, data quality enhancement,
transformation and delivery to and from complex sources i.e.
capability to get data from any sources format and deliver it
to any targets, within or outside the enterprise, at the right
time.
It also allows integration and information access for
diverse data and content regardless of the placement of data.
3. With the data replication services customer information
management can be done quickly.
G. Microsoft SQL ServerIntegration Services
Microsoft SQL Server Integration Services (MS SSIS)
[14] allows run time data transfer and management.
Designed for enterprise wide application support, it provides
a platform for performing ETL functions and creating and
controlling data packages. It allows formation of script
application using .net platform support, increased scalability
with thread pooling, and a more advanced import and export
wizard. It also allows customization of the package suiting
specific organization needs, usage of digital sign for security
and supports service oriented architecture.
IV. ETL TOOL FEATURES
With the available span of functionality and quite a
number of ETL tool vendors it is quite difficult to rank all
the variety of tools as every tool has some special features
too. Some generic behaivour has been identified by [5] on
the basis of which following comparison and graph making
is done.
Following general aspects can be kept in mind when
evaluating an ETL tool
A. Architecture
For evaluating any tool with respect to architecture
aspects such as support for parallel processing, symmetric
multiprocessing, massive multi processing, clustering, load
balancing and feasibility for grid computing should be
considered. Also support for multi user management of ETL
processes running on multiple machines and support for
common meta-model i.e. allowing for exchange of meta data
with self brand and other brands is to be considered too.
B. Functionality
Two main aspects relating to functionality of an ETL tool
are important i.e. the metadata support and the overall
functionality provided by the tool.
The main functionality focuses of whether the tool is data
cleansing oriented or data transformation oriented, or it
performs both equally. Thus one gets a clear picture of what
tool to select depending on the nature of data that shall be put
into the tool. Also the support for direct connection to data
source for input is also an important aspect of functionality.
On the other hand support of metadata is a key aspect
too. An ETL is also responsible of using metadata to map
source data to destination. Thus choosing a tool that
conforms to organizations metadata strategy is very
important.
C. Usability
The usability is one of the important factors of any tool.
Thus points to consider are that the tool should be easy to
use, understand and fast to get used to. In this regard aspects
of concern are that tool should have a well balanced interface
and must support the typical tasks sequence as of any ETL
usage.
D. Reusability
The reusability depends on that the components of a data
ware house architecture, which is constructed using the ETL
tool, must be reusable and can handle parameters. The tools
should be capable of dividing the process into small building
blocks, allow user to make user defined functions and
allowing these functions to be used in the process flow.
E. Connectivity
The main aspects to consider include the native
connections the tool supports, the packages its can read
metadata from, the type of message queuing products the
tool can connect to, capability to graphically join tables,
support for changed data capture principle, transformation
matching and address cleansing ability as well as options for
data profiling uniqueness and distribution etc.
F. Interoperability
Last but not he least the tool should be capable to run on
a number of platforms and also on the different versions of a
product.
V. ANALYSIS OF ETL TOOLS
With all the aspects, as discussed in section IV, in mind
an analysis of the services provided by the tools is discussed
hereafter. Thus in choosing any tools its respective aspects
should be considered. Following graph based analysis
provides support for the decision making. For this analysis
various websites, vendor’s white papers, web-blogs,
comparisons and previous surveys were consulted and thus
based on the basic set of features discussed in section IV the
analysis was conducted.
Each of the above mentioned ETL tools, as discussed in
section III, is graded on the basis of points according to the
level of services supported while the vendors are depicted by
the acronyms in graphs instead of full names.
A. Architectural Aspects
Based on the support of enterprise architecture,
clustering, data separation into groups, Web based
application interface support & cloud computing deployment
support following graph depicts the current services
supported by tools.
Thus IPC and OWB are nice in architectural support with
SSIS coming up right behind.
B. ETL Functionality
Depending upon completeness of tools in terms of
functionality points have been given. Thus support for data
cleansing, transformation, support for integration services
and common metadata model support are the main aspects
considered. The graph is drawn by adding up the points
granted to each tool depending upon the support it provided
i.e. one point for each aspect and then adding up those points
which fall into one category. Same case was done for both
trends i.e. basic functionalities in 2007 and improvements till
2010.
4. Architectural Aspects
0
5
10
15
20
25
30
35
IBM IS I PC Talend
OS
OWB MS SSIS BO SAP SAS DIS Others
Web-based UI Clustering and Job Distribution
Enalbes SOA Deploy in Cloud Option
Figure 1. Architectural Support
ETL Functionality Provided
0
5
10
15
20
25
30
35
40
45
50
IBM IS I PC Talend
OS
OWB MS
SSIS
BO
SAP
SAS
DIS
Others
Vendors
Points
2007 2010 improvement
Figure 2. Functionality
C. Usability
This graph covers all the points graded to a tool on the
basis of an easy to use, a well designed and a balanced
interface. What you see is what you get (WYSIWYG) and
task compatibility also is other basis of grade. Each point
graded gets accumulated by the existence of a subset of
services necessary of ease of use and understanding. Also
ease of training new users to become used to the interface is
a part of criterion.
D. Reusability
The graph, as follows, depicts a comparison and point
grading on basis of reusability factor supported, capability of
data stream splitting, automatic documentation and support
for definition of user defined functions and using them in the
process flow.
Ease Of Use
0
1
2
3
4
5
6
7
8
IBMIS IPC Talend
OS
OWB MS SSIS BO SAP SAS DIS Others
Vendors
Points
Original 2007 Improvement 2010
Figure 3. Usability
Reusability
0
5
10
15
20
25
30
35
40
IBM IS I PC Talend
OS
OWB MS
SSIS
BO
SAP
SAS
DIS
Others
Reusable service Repository Split Data Streams
Data Partitioning Automatic Documentation
Figure 4. Reusability
E. Connectivity
Connectivity as the name indicates is calculated by
aggregating the points granted to a tool on the following
aspects. These include total number of all the sources which
could be read in without any additional middleware, the
enterprise applications supported by the tool, the platforms it
can run on and last but not the least the support for
messaging (i.e. real time data handling).
F. Interoperability
The support of various platforms in detail is provided in
following graph. Here all Windows & Linux versions are
considered as one while UNIX versions are catered
separately.
5. Connectivity
0
10
20
30
40
50
60
70
80
90
100
IBM IS I PC Talend
OS
OWB MS
SSIS
BO
SAP
SAS
DIS
Others
Vendors
Points
Platfroms Data Sources Packages Messages
Figure 5. Connectivity
Interoperability
0
10
20
30
40
50
60
70
80
90
100
IBM IS I PC Talend
OS
OWB MS SSIS BO SAP SAS DIS Others
Windows Linux Sun Solaris
HP-UX IBM A/X IBM iSeries OS400
IBM zSeries MVS HP Tru64 Open VMS
Figure 6. Interoperability
From all the analysis conducted it is still hard to
generalize which tool is the best. Though Infomatica proves
to be better in quite many features but MS SSIS and OWB
have improved well overtime and now are in pace with the
high contenders too. Overall it can bee seen when
considering pure ETL tools then IPC can be ranked as still
the market leader with IBM IS coming second along side
Talend OS. However when it comes to DB integrated Tools
then OWB and SSIS follow IPC directly. Thus one should be
careful in selecting the tool as it may not be the best for
organization just by the name of vendor. The capabilities of
the tool should be reviewed before selection.
VI. CONCLUSION
Important data in most of the organizations is under
utilized just because it exists around in different formats and
in various resources. Data warehouses (DWs) are complex
systems having consolidated data with a main objective to
assist the knowledge workers in decision making process.
The key components of DWs are the Extraction-
Transformation-Loading (ETL) processes. The goal of this
paper is to elaborate ETL process, its importance relevant to
the data warehouses and provide a comparison based on
some generalized criteria to find suitability of a tool for a
certain category of consumers. The paper provides a brief
overview of the available ETL tools in market, specifies
some key points that can be made for generalizing
capabilities provided by a tool and using graph based
analysis on a grade point scale to grade the specific tools
selected. This all provides a comparison of the available
tools in terms of the features they provide helping an
organization choose which tool will best suit its needs.
REFERENCES
[1] T.Y. Wah, H. Peng, and C.S. Hok, “Building Data Warehouse,” Proc.
24th South East Asia Regional Computer Conference, November 18-
19, 2007, Bangkok, Thailand
[2] Tho, M. Njuyen, Tjoa, A. Min; Zero-Latency Data Warehousing for
Heterogeneous Data Sources and Continuous Data Streams, Institute
of Software Technology and Interactive Systems Favoriteristr. 9-
11/188, 2003
[3] T. Jaorg, S. Dessloch, Near Real-Time Data Warehousing Using
State-of-the-Art ETL Tools, University of Kaiserslautern, 67653
Kaiserslautern, Germany, 2009.
[4] Passionned, 'The BI Tool survey report”, 2008.
[5] Passionned, “ETL Tools survey report”, 2009.
[6] J. Levin, “ETL Tools Comparison”, March 2008.
[7] Dr. R. Chillar; B. Kochar; Extraction Transformation Loading –A
Road to Data warehouse, 2nd National Conference Mathematical
Techniques: Emerging Paradigms for Electronics and IT Industries
[8] Guide to Data Warehousing and Business Intelligence, available at
http://data-warehouses.net/architecture/etlprocess.html.
[9] Pervasive Systems, Extraordinarily Flexible ETL
Platform,http://www.pervasiveintegration.com/scenarios/Pages/etl_to
ols_data_aggregation.aspx.
[10] Adeptia incorporation, ETL Vendors Comparison, available at
http://www.adeptia.com/products/etl_vendor_comparison.htm
l.
[11] Guide to Data ware housing and Business Intelligence, Architectural
Overview, available at http://data-
warehouses.net/architecture/overview.html.
[12] ETL tools Survey, available at http://www.etltool.com/what-is-
etl.htm.
[13] Oracle Ware house builder 11g, A technical overview, at
http://www.oracle.com/technology/products/warehouse/index.html.
[14] ETL data ware house concepts, available at http://etl-
information.blogspot.com/2007_07_01_archive.htm