SlideShare a Scribd company logo
1 of 70
Real-Time Data Warehouse Loading Methodology
Ricardo Jorge Santos
CISUC – Centre of Informatics and Systems
DEI – FCT – University of Coimbra
Coimbra, Portugal
[email protected]
Jorge Bernardino
CISUC, IPC – Polytechnic Institute of Coimbra
ISEC – Superior Institute of Engineering of Coimbra
Coimbra, Portugal
[email protected]
ABSTRACT
A data warehouse provides information for analytical
processing,
decision making and data mining tools. As the concept of real-
time enterprise evolves, the synchronism between transactional
data and data warehouses, statically implemented, has been
redefined. Traditional data warehouse systems have static
structures of their schemas and relationships between data, and
therefore are not able to support any dynamics in their structure
and content. Their data is only periodically updated because
they
are not prepared for continuous data integration. For real-time
enterprises with needs in decision support purposes, real-time
data
warehouses seem to be very promising. In this paper we present
a
methodology on how to adapt data warehouse schemas and user-
end OLAP queries for efficiently supporting real-time data
integration. To accomplish this, we use techniques such as table
structure replication and query predicate restrictions for
selecting
data, to enable continuously loading data in the data warehouse
with minimum impact in query execution time. We demonstrate
the efficiency of the method by analyzing its impact in query
performance using benchmark TPC-H executing query
workloads
while simultaneously performing continuous data integration at
various insertion time rates.
Keywords
real-time and active data warehousing, continuous data
integration for data warehousing, data warehouse refreshment
loading process.
1. INTRODUCTION
A data warehouse (DW) provides information for analytical
processing, decision making and data mining tools. A DW
collects data from multiple heterogeneous operational source
systems (OLTP – On-Line Transaction Processing) and stores
summarized integrated business data in a central repository used
by analytical applications (OLAP – On-Line Analytical
Processing) with different user requirements. The data area of a
data warehouse usually stores the complete history of a
business.
The common process for obtaining decision making information
is based on using OLAP tools [7]. These tools have their data
source based on the DW data area, in which records are updated
by ETL (Extraction, Transformation and Loading) tools. The
ETL
processes are responsible for identifying and extracting the
relevant data from the OLTP source systems, customizing and
integrating this data into a common format, cleaning the data
and
conforming it into an adequate integrated format for updating
the
data area of the DW and, finally, loading the final formatted
data
into its database.
Traditionally, it has been well accepted that data warehouse
databases are updated periodically – typically in a daily, weekly
or even monthly basis [28] – implying that its data is never up-
to-
date, for OLTP records saved between those updates are not
included the data area. This implies that the most recent
operational records are not included into the data area, thus
getting excluded from the results supplied by OLAP tools. Until
recently, using periodically updated data was not a crucial
issue.
However, with enterprises such as e-business, stock brokering,
online telecommunications, and health systems, for instance,
relevant information needs to be delivered as fast as possible to
knowledge workers or decision systems who rely on it to react
in
a near real-time manner, according to the new and most recent
data captured by an organization’s information system [8]. This
makes supporting near real-time data warehousing (RTDW) a
critical issue for such applications.
The demand for fresh data in data warehouses has always been a
strong desideratum. Data warehouse refreshment (integration of
new data) is traditionally performed in an off-line fashion. This
means that while processes for updating the data area are
executed, OLAP users and applications cannot access any data.
This set of activities usually takes place in a preset loading time
window, to avoid overloading the operational OLTP source
systems with the extra workload of this workflow. Still, users
are
pushing for higher levels of freshness, since more and more
enterprises operate in a business time schedule of 24x7. Active
Data Warehousing refers to a new trend where DWs are updated
as frequently as possible, due to the high demands of users for
fresh data. The term is also designated as Real-Time Data
Warehousing for that reason in [24]. The conclusions presented
by T. B. Pedersen in a report from a knowledge exchange
network
formed by several major technological partners in Denmark [17]
refer that all partners agree real-time enterprise and continuous
data availability is considered a short term priority for many
business and general data-based enterprises. Nowadays, IT
managers are facing crucial challenges deciding whether to
build
a real-time data warehouse instead of a conventional one and
whether their existing data warehouse is going out of style and
needs to be converted into a real-time data warehouse to remain
competitive. In some specific cases, data update delays larger
than
a few seconds or minutes may jeopardise the usefulness of the
whole system.
Permission to make digital or hard copies of all or part of this
work for
personal or classroom use is granted without fee provided that
copies are
not made or distributed for profit or commercial advantage and
that
copies bear this notice and the full citation on the first page. To
copy
otherwise, or republish, to post on servers or to redistribute to
lists,
requires prior specific permission and/or a fee.
IDEAS’08, September 10–12, 2008, Coimbra, Portugal
Editor: Bipin C. DESAI
Copyright 2008 ACM 978-1-60158-188-0/08/09…$5.00
In a nutshell, accomplishing near zero latency between OLTP
and
OLAP systems consists in insuring continuous data integration
from the former type of systems into the last. In order to make
this feasible, several issues need to be taken under
consideration:
(1) Operational OLTP systems are designed to meet well-
specified (short) response time requirements aiming for
maximum
system availability, which means that a RTDW scenario would
have to cope with this in the overhead implied in those OLTP
systems; (2) The tables existing in a data warehouse’s database
directly related with transactional records (commonly named as
fact tables) are usually huge in size, and therefore, the addition
of
new data and consequent procedures such as index updating or
referential integrity checks would certainly have impact in
OLAP
systems’ performance and data availability. Our work is focused
on the DW perspective, for that reason we present an efficient
methodology for continuous data integration, performing the
ETL
loading process.
This paper presents a solution which enables efficient
continuous
data integration in data warehouses, while allowing OLAP
execution simultaneously, with minimum decrease of
performance. With this, we seek to minimize the delay between
the recording of transactional information and its reflected
update
on the decision support database. If the operational data
extraction
and transformation are able of performing without significant
delay during the transaction itself, this solution will load all the
decision support information needed in the data warehouse in
useful time, allowing an answer while the transaction still is
occurring. This is the general concept of real time data
warehousing. The issues focused in this paper concern the DW
end of the system, referring how to perform loading processes
of
ETL procedures and the DW’s data area usage for efficiently
supporting continuous data integration. The items concerning
extracting and transforming of operational (OLTP) source
systems data are not the focus of this paper. Therefore, we shall
always be referring the data warehouse point of view, in the
remainder of the paper.
Based on the existing schema(s) of the DW’s database, our
methodology consists on creating a replica for each of its tables,
empty of contents and without defining any type of indexes,
primary keys or any other kind of restrictions or constraints. In
this new schema, the replicated tables will receive and record
the
data from the staging area, which will be continuously loaded.
The fact that they are initially created empty of contents and not
having any sort of constraints allows to considerably minimize
the
consumption of time and resources which are necessary in the
procedures inherent to data integration. As the data integration
is
performed, the database’s performance which includes the new
most recent data deteriorates, due to the lack of usual data
structures that could optimize it (such as indexes, for example),
in
the replicated tables. When the data warehouse’s performance is
considered as unacceptable by its users or administrators, the
existing data within the replicated tables should serve for
updating
the original schema. After performing this update, the replicated
tables are to be recreated empty of contents, regaining
maximum
performance. We also demonstrate how to adapt OLAP queries
in
the new schema, to take advantage of the most recent data
which
is integrated in real time.
The experiments which are presented were performed using a
standard benchmark for decision support systems, benchmark
TPC-H, from the Transaction Processing Council [22]. Several
configurations were tested, varying items such as the database’s
physical size, the number of simultaneous users executing
OLAP
queries, the amount of available RAM memory and the time
rates
between transactions.
The remainder of this paper is organized as follows. In section
2,
we refer the requirements for real-time data warehousing.
Section
3 presents background and related work in real-time data
warehousing. Section 4 explains our methodology, and in
section
5 we present the experimental evaluation of our methods and
demonstrate its functionality. The final section contains
concluding remarks and future work
2. REQUIREMENTS FOR REAL-TIME
DATA WAREHSOUSING
Nowadays, organizations generate large amounts of data, at a
rate
which can easily reach several megabytes or gigabytes per day.
Therefore, as time goes by, a business data warehouse can
easily
grow to terabytes or even petabytes. The size of data
warehouses
imply that each query which is executed against its data area
usually accesses large amounts of records, also performing
actions such as joins, sorting, grouping and calculation
functions.
To optimize these accesses, the data warehouse uses predefined
internal data structures (such as indexes or partitions, for
instance), which are also large in size and have a very
considerable level of complexity. These facts imply that it is
very
difficult to efficiently update the data warehouse’s data area in
real-time, for the propagation of transactional data in real-time
would most likely overload the server, given its update
frequency
and volume; it would involve immense complex operations on
the
data warehouse’s data structures and dramatically degrade
OLAP
performance.
In a nutshell, real-time data warehouses aim for decreasing the
time it takes to make decisions and try to attain zero latency
between cause and effect for that decision, closing the gap
between intelligent reactive systems and systems processes. Our
aim is transforming a standard DW using batch loading during
update windows (during which analytical access is not allowed)
into near zero latency analytical environment providing current
data, in order to enable (near) real-time dissemination of new
information across an organization. The business requirements
for
this kind of analytical environment introduce a set of service
level
agreements that go beyond what is typical in a traditional DW.
The major issue is how to enable continuous data integration
and
assuring that it minimizes negative impact in several main
features of the system, such as availability and response time of
both OLTP and OLAP systems.
An in-depth discussion of these features from the analytical
point
of view (to enable timely consistent analysis) is given in [6].
Combining highly available systems with active decision
engines
allows near real-time information dissemination for data
warehouses. Cumulatively, this is the basis for zero latency
analytical environments [6]. The real-time data warehouse
provides access to an accurate, integrated, consolidated view of
the organization’s information and helps to deliver near real-
time
information to its users. This requires efficient ETL techniques
enabling continuous data integration, the focus of this paper.
By adopting real-time data warehousing, it becomes necessary
to
cope with at least two radical data state changes. First, it is
necessary to perform continuous data update actions, due to the
continuous data integration, which should mostly concern row
insertions. Second, these update actions must be performed in
parallel with the execution of OLAP, which – due to its new
real-
time nature – will probably be solicited more often. Therefore,
the
main contributions of this paper are threefold:
• Maximizing the freshness of data by efficiently and
rapidly integrating most recent OLTP data into the data
warehouse;
• Minimizing OLAP response time while simultaneously
performing continuous data integration;
• Maximizing the data warehouse’s availability by reducing
its update time window, in which users and OLAP
applications are off-line.
3. RELATED WORK
So far, research has mostly dealt with the problem of
maintaining
the warehouse in its traditional periodical update setup [14, 27].
Related literature presents tools and algorithms to populate the
warehouse in an off-line fashion. In a different line of research,
data streams [1, 2, 15, 20] could possibly appear as a potential
solution. However, research in data streams has focused on
topics
concerning the front-end, such as on-the-fly computation of
queries without a systematic treatment of the issues raised at the
back-end of a data warehouse [10]. Much of the recent work
dedicated to RTDW is also focused on conceptual ETL
modelling
[4, 5, 19, 23], lacking the presentation of concrete specific
extraction, transformation and loading algorithms along with
their
consequent OLTP and OLAP performance issues.
Temporal data warehouses address the issue of supporting
temporal information efficiently in data warehousing systems
[25]. In [27], the authors present efficient techniques (e.g.
temporal view self-maintenance) for maintaining data
warehouses
without disturbing source operations. A related challenge is
supporting large-scale temporal aggregation operations in data
warehouses [26]. In [4], the authors describe an approach which
clearly separates the DW refreshment process from its
traditional
handling as a view maintenance or bulk loading process. They
provide a conceptual model of the process, which is treated as a
composite workflow, but they do not describe how to efficiently
propagate the date. Theodoratus et al. discuss in [21] data
currency quality factors in data warehouses and propose a DW
design that considers these factors.
An important issue for near real-time data integration is the
accommodation of delays, which has been investigated for
(business) transactions in temporal active databases [18]. The
conclusion is that temporal faithfulness for transactions has to
be
provided, which preserves the serialization order of a set of
business transactions. Although possibly lagging behind real-
time,
a system that behaves in a temporally faithful manner
guarantees
the expected serialization order.
In [23], the authors describe the ARKTOS ETL tool, capable of
modeling and executing practical ETL scenarios by providing
explicit primitives for capturing common tasks (such as data
cleaning, scheduling and data transformations) using a
declarative
language. ARKTOS offers graphical and declarative features for
defining DW transformations and tries to optimize the execution
of complex sequences of transformation and cleansing tasks.
In [13] is described a zero-delay DW with Gong, which assists
in
providing confidence in the data available to every branch of
the
organization. Gong is a Tecco product [3] that offers uni or bi-
directional replication of data between homogeneous and/or
heterogeneous distributed databases. Gong’s database
replication
enables zero-delay business in order to assist in the daily
running
and decision making of the organization.
But not all transactional informations need to be immediately
dealt with in real-time decision making requirements. We can
define which groups of data is more important to include
rapidly
in the data warehouse and other groups of data which can be
updated in latter time. Recently, in [9], the authors present an
interesting architecture on how to define the types of update and
time priorities (immediate, at specific time intervals or only on
data warehouse offline updates) and respective synchronization
for each group of transactional data items.
Our methodology overcomes some of the mentioned drawbacks
and is presented in the next section.
4. CONTINUOUS DATA WAREHOUSE
LOADING METHODOLOGY
From the DW side, updating huge tables and related structures
(such as indexes, materialized views and other integrated
components) makes executing OLAP query workloads
simultaneously with continuous data integration a very difficult
task. Our methodology minimizes the processing time and
workload required for these update processes. It also facilitates
the DW off-line update (see section 4.4), because the data
already
lies within the data area and all OLTP data extraction and/or
transformation routines have been executed during the
continuous
data integration. Furthermore, the data structure of the
replicated
tables is exactly the same as the original DW schema. This
minimizes the time window for packing the data area of the
DW,
since its update represents a one step process by resuming itself
as
a cut-and-paste action from the temporary tables to the original
ones, as we shall demonstrate further.
Our methodology is focused on four major areas: (1) data
warehouse schema adaptation; (2) ETL loading procedures; (3)
OLAP query adaptation; and (4) DW database packing and
reoptimization. It is mainly based on a very simple principle:
new
row insertion procedures in tables with few (or no) contents are
performed much faster than in big size tables. It is obvious and
undeniable that data handling in small sized tables are much
less
complex and much faster than in large sized tables. As a matter
of
fact, this is mainly the reason why OLTP data sources are
maintained with the fewer amount possible of necessary
records,
in order to maximize its availability. The proposed continuous
data warehouse loading methodology is presented in Figure 1.
4.1 Adapting the Data Warehouse Schema
Suppose a very simple sales data warehouse with the schema
illustrated in Figure 2, having two dimensional tables (Store and
Customer, representing business descriptor entities) and one
fact table (Sales, storing business measures aggregated from
transactions). To simplify the figure, the Date dimension is not
shown. This DW allows storing the sales value per store, per
customer, per day. The primary keys are represented in bold,
while the referential integrity constraints with foreign keys are
represented in italic. The factual attribute S_Value is additive.
This property in facts is very important for our methodology, as
we shall demonstrate further on.
For the area concerning data warehouse schema adaptation, we
adopt the following method:
Data warehouse schema adaptation method for
supporting real-time data warehousing: Creation of an
exact structural replica of all the tables of the data
warehouse that could eventually receive new data. These
tables (referred also as temporary tables) are to be created
empty of contents, with no defined indexes, primary key, or
constraints of any kind, including referential integrity. For
each table, an extra attribute must be created, for storing a
unique sequential identifier related to the insertion of each
row within the temporary tables.
The modified schema for supporting RTDW based on our
methodology is illustrated in Figure 3. The unique sequential
identifier attribute present in each temporary table should
record
the sequence in which each row is appended in the Data Area.
This will allow identifying the exact sequence of arrival for
each
new inserted row. This is useful for restoring prior data states in
disaster recovery procedures, and also for discarding
dimensional
rows which have more recent updates. For instance, if the same
customer has had two updates in the OLTP systems which,
consequently, lead to the insertion of two new rows in the
temporary table CustomerTmp, only the most recent one is
relevant. This can be defined by considering as most recent the
row with highest CTmp_Counter for that same customer
(CTmp_CustKey).
CustomerTmp
CTmp_CustKey
CTmp_Name
CTmp_Address
CTmp_PostalCode
CTmp_Phone
CTmp_EMail
CTmp_Counter
SalesTmp
STmp_StoreKey
STmp_CustomerKey
STmp_Date
STmp_Value
STmp_Counter
StoreTmp
StTmp_StoreKey
StTmp_Description
StTmp_Address
StTmp_PostalCode
StTmp_Phone
StTmp_EMail
StTmp_Manager
StTmp_Counter
The authors of the ARKTOS tool [23] refer that their own
experience, as well as the most recent literature, suggests that
the
main problems of ETL tools do not consist only in performance
problems (as normally would be expected), but also in aspects
such as complexity, practibility and price. By performing only
record insertion procedures inherent to continuous data
integration using empty or small sized tables without any kind
of
constraint or attached physical file related to it, we guarantee
the
simplest and fastest logical and physical support for achieving
our
goals [12].
The fact that the only significant change in the logical and
physical structure of the data warehouse’s schema is the simple
adaptation shown in Figure 3, allows that the implementation of
the necessary ETL procedures can be made in a manner to
maximize its operationability. Data loading may be done by
simple standard SQL instructions or DBMS batch loading
software such as SQL*Loader [16], with a minimum of
complexity. There is no need for developing complex routines
for
updating the data area, in which the needed data for is easily
accessible, independently from the used ETL tools.
4.2 ETL Loading Procedures
To refresh the data warehouse, once the ETL application has
extracted and transformed the OLTP data into the correct format
for loading the data area of the DW, it shall proceed
immediately
in inserting that record as a new row in the correspondent
temporary table, filling the unique sequential identifier attribute
with the autoincremented sequential number. This number
should
start at 1 for the first record to insert in the DW after executing
Figure 1. General architecture of the proposed continuous data
warehouse loading methodology.
Figure 2. Sample sales data warehouse schema.
Figure 3. Sample sales data warehouse schema modified
for supporting real-time data warehousing.
the packing and reoptimizing technique (explained in section
4.4
of this paper), and then be autoincremented by one unit for each
record insertion. The algorithm for accomplishing continuous
data
integration by the ETL tool may be similar to:
Trigger for each new record in OLTP system (after it is
commited)
Extract new record from OLTP system
Clean and transform the OLTP data, shaping it into the
data warehouse destination table’s format
Increment record insertion unique counter
Create a new record in the data warehouse temporary
destination table
Insert the data in the temporary destination table’s new
record, along with the value of the record insertion
unique counter
End_Trigger
Following, we demonstrate a practical example for explaining
situations regarding updating the data warehouse shown in
Figure
3. Figure 4 presents the insertion of a row in the data warehouse
temporary fact table for the recording of a sales transaction of
value 100 which took place at 2008-05-02 in store with
St_StoreKey = 1 related to customer with C_CustKey = 10,
identified by STmp_Counter = 1001. Meanwhile, other
transactions occurred, and the organization’s OLTP system
recorded that instead of a value of 100 for the mentioned
transaction, it should be 1000. The rows in the temporary fact
table with STmp_Counter = 1011 and STmp_Counter =
1012 reflect this modification of values. The first eliminates the
value of the initial transactional row and the second has the new
real value, due to the additivity of the STmp_Value attribute.
The definition of which attributes are additive and which are
not
should be the responsibility of the Database Administrator.
According to [11], the most useful facts in a data warehouse are
numeric and additive.
The method for data loading uses the most simple method for
writing data: appending new records. Any other type of writing
method needs the execution of more time consuming and
complex
tasks.
STmp_StoreKey STmp_CustomerKey STmp_Date STmp_Value
STmp_Counter
1 10 2008-05-02 100 1001
1 10 2008-05-02 -100 1011
1 10 2008-05-02 1000 1012
4.3 OLAP Query Adaptation
Consider the following OLAP query, for calculating the total
revenue per store in the last seven days.
SELECT S_StoreKey,
Sum(S_Value) AS Last7DaysSaleVal
FROM Sales
WHERE S_Date>=SystemDate()-7
GROUP BY S_StoreKey
To take advantage of our schema modification method and
include the most recent data in the OLAP query response,
queries
should be rewritten taking under consideration the following
rule:
the FROM clause should join all rows from the required original
and temporary tables with relevant data, excluding all fixed
restriction predicate values from the WHERE clause whenever
possible. The modification for the prior instruction is illustrated
below, with respect to our methodology. It can be seen that the
relevant rows from both issue tables are joined for supplying
the
OLAP query answer, filtering the rows used in the resulting
dataset according to its restrictions in the original instruction.
SELECT S_StoreKey,
Sum(S_Value) AS Last7DaysSaleVal
FROM (SELECT S_StoreKey,
S_Value FROM Sales
WHERE S_Date>=SystemDate()-7)
UNION ALL
(SELECT STmp_StoreKey,
STmp_Value FROM SalesTmp
WHERE STmp_Date>=SystemDate()-7)
GROUP BY S_StoreKey
An interesting and relevant aspect of the proposed methodology
is
that if OLAP users wish to query only the most recent
information, they only need to do so against the temporary
replicated tables. For instance, if the temporary tables are meant
to be filled with data for each business day before they are
Figure 4. Partial contents of the temporary fact table SalesTmp
with exemplification of record insertions.
recreated, and we want to know the sales value of the current
day,
per store, the adequate response could be obtained from the
following SQL instruction:
SELECT STmp_StoreKey,
Sum(STmp_Value) AS TodaysValue
FROM SalesTmp
WHERE STmp_Date=SystemDate()
GROUP BY STmp_StoreKey
This way, our method aids the processing of the data
warehouse’s
most recent data, for this kind of data is stored within the
temporary replica tables, which are presumed to be small in
size.
This minimizes CPU, memory and I/O costs involved in most
recent data query processing. Theoretically, this would make it
possible to deliver the most recent decision making information
while the business transaction itself occurs.
4.4 Packing and Reoptimizing the Data
Warehouse
Since the data is being integrated within tables that do not have
access optimization of any kind that could speed up querying,
such as indexes, for instance, it is obvious that its functionality
is
affected, implying a decrease of performance. Due to the size of
physically occupied space, after a certain number of insertions
the
performance becomes too poor to consider as acceptable. To
regain performance in the DW it is necessary to execute a pack
routine which will update the original DW schema tables using
the records in the temporary tables, and recreate these
temporary
tables empty of contents, along with rebuilding the original
tables’ indexes and materialized views, so that maximum
processing speed is obtained once more.
For updating the original DW tables, the rows in the temporary
tables should be aggregated according to the original tables’
primary keys, maintaining the rows with highest unique counter
attribute value for possible duplicate values in non-additive
attributes, for they represent the most recent records. The time
needed for executing these update procedures represents the
only
period of time in which the DW in unavailable to OLAP tools
and
end users, for they need to be executed exclusively. The
appropriate moment for doing this may be determined by the
Database Administrator, or automatically, taking under
consideration parameters such as a determined number of
records
in the temporary tables, the amount of physically occupied
space,
or yet a predefined period of time. The determination of this
moment should consist on the best possible compromise related
to
its frequency of execution and the amount of time it takes away
from all user availability, which depends on the physical,
logical
and practical characteristics inherent to each specific DW
implementation itself and is not object of discussion in this
paper.
4.5 Final Remarks on Our Methodology
Notice that only record insertions are used for updating the DW
for all related transactions in the OLTP source systems. Since
this
type of procedure does not require any record locking in the
tables (except for the appended record itself) nor search
operations for previously stored data before writing data (like in
update or delete instructions), the time necessary to accomplish
this is minimal. The issue of record locking is strongly enforced
by the fact that the referred tables do not have any indexes or
primary keys, implying absolutely no record locking, except for
the appended record itself. Furthermore, since they do not have
constraints of any sort, including referential integrity and
primary
keys, there is no need to execute time consuming tasks such as
index updating or referential integrity cross checks. Kimball
refers in [12] that many ETL tools use a UPDATE ELSE
INSERT
function in loading data, considering this as a performance
killer.
With our method, any appending, updating or eliminating data
tasks on OLTP systems reflect themselves as only new record
insertions in the data warehouse, which allows minimizing row,
block and table locks and other concurrent data access
problems.
Physical database tablespace fragmentation is also avoided,
once
there is now deletion of data, only sequential increments. This
allows us to state that the data update time window for our
methods is minimal for the insertion of new data, maximizing
the
availability of that data, and consequently contributing to
effectively increase the data warehouses’ global availability and
minimize any negative impact in its performance.
As mentioned earlier, the extracted data usually needs to be
corrected and transformed before updating the data warehouse’s
data area. Since we pretend to obtain this data near real-time,
the
time gap between recording OLTP transactions and their
extraction by ETL processes is minimal, occurring nearly at the
same time, which somewhat reduces error probability. We can
also assume that the amount of intermediate “information
buckets” which the data passes through in the ETL Area is also
minimal, for temporary storage is almost not needed.
Furthermore, instead of extracting a considerable amount of
OLTP data, which is what happens in the “traditional” bulk
loading data warehousing, the volume of information extracted
and transformed in real-time is extremely reduced (representing
commonly few dozen bytes), since it consists of only one
transaction per execution cycle. All this allows assuming that
the
extraction and transformation phase will be cleaner and more
time
efficient.
As a limitation to our methodology, data warehouse contexts in
which additive attributes are difficult or even not possible to
define for their fact tables may invalidate its practice.
5. EXPERIMENTAL EVALUATION
Recurring to TPC-H decision support benchmark (TPC-H) we
tested our methodology, creating 5GB, 10GB and 20GB sized
DWs for continuous data integration at several time rates, in
ORACLE 10g DBMS [16]. We used an Intel Celeron 2.8GHz
with variations of 512MB, 1GB and 2GB of SDRAM and a
7200rpm 160GB hard disk. The modified TPC-H schema
according to our methodology can be seen in Figure 5. Tables
Region and Nation are not included as temporary tables
because they are fixed-size tables in the TPC-H benchmark, and
therefore do not receive new data. For the query workloads we
selected TPC-H queries 1, 8, 12 and 20 (TPC-H), executed in
random order for each simultaneous user in the tests, for up to 8
hours of testing for each scenario.
The summarized results which are presented in this section
correspond to a total amount of approximately 4000 hours of
testing.
The definition of the number of transactions, average time
interval between transactions, number of records and data size
to
be inserted for each scenario number of records are illustrated
in
Table 1. Each new transaction represents the insertion of an
average of four records in LineItemTmp and one row in each of
the other temporary tables, continuously being integrated for a
period of 8 hours. For each database size, we also tested three
different transaction insertion time interval frequencies,
referenced as 1D, 2D and 4D, each corresponding to a different
amount of records to be inserted throughout the tests.
The impact in OLAP processing while performing the
continuous
data integration is shown in Figures 6 and 7. Table 2 resumes
the
results for each scenario. All results show that the system is
significantly dependable on the transaction rates, performing
better if more RAM is used. They present a minimal cost of
around 8% of query response time for practicing our
methodology, independently from the amount of RAM, for a
transactional frequency of inserting 1 transaction every 9,38
seconds. Apparently, this is the lowest price to pay for having
real
time data warehousing with our methodology, for the tested
benchmark. At the highest transactional frequency, 1 every 0,57
seconds, the increase of OLAP response time climbs up to a
maximum of 38,5% with 512MB of RAM memory. The results
also show that the methodology is also relatively scalable in
what
concerns database size.
Scenario Nr. Transactions to load in 8 Hours
Average Time
between Transactions
N.º Records to
integrate in 8 Hours
Total Size of RTDW
Transactions
DW5GB-1D 3.072 9,38 seconds 54.040 4,59 MBytes
DW5GB-2D 6.205 4,64 seconds 106.302 9,28 MBytes
DW5GB-4D 12.453 2,31 seconds 204.904 18,62 MBytes
DW10GB-1D 6.192 4,65 seconds 109.464 9,26 MBytes
DW10GB-2D 12.592 2,29 seconds 213.912 18,83 MBytes
DW10GB-4D 25.067 1,15 seconds 410.764 37,48 MBytes
DW20GB-1D 12.416 2,32 seconds 218.634 18,57 MBytes
DW20GB-2D 25.062 1,15 seconds 429.214 37,48 MBytes
DW20GB-4D 50.237 0,57 seconds 825.242 75,12 MBytes
Table 1. Characterization of the experimental evaluation
scenarios.
Figure 5. TPC-H data warehouse schema modified for
supporting real-time data warehousing.
Average Increase OLAP Response Time
(Data Warehouse Dimension - RAM Memory)
8,8 8,3 8,5 9,7 9,1
23,0 20,4 17,4
31,9
25,3
20,1
38,5
30,4
25,9
10,1
17,7
11,213,4
17,2
20,2
27,8
15,418,0
22,6
13,314,315,7
0,0
10,0
20,0
30,0
40,0
50,0
O
LA
P
re
sp
on
se
ti
m
e
in
cr
ea
se
(%
)
RTDW-1D RTDW-2D RTDW-4D
RTDW-1D 8,8 8,3 8,5 13,4 11,2 9,7 17,7 10,1 9,1
RTDW-2D 15,7 14,3 13,3 22,6 18,0 15,4 27,8 20,2 17,2
RTDW-4D 23,0 20,4 17,4 31,9 25,3 20,1 38,5 30,4 25,9
5GB-512MB 5GB-1GB 5GB-2GB 10GB-512MB 10GB-1GB
10GB-2GB 20GB-512MB 20GB-1GB 20GB-2GB
Ave rage increase OLAP re spons e tim e percentage for query
w ork load per
transactional frequency
8,8
14,6
21,1
29,9
38,5
12,8
16,2
22,8
30,4
11,5
14,0
18,7
25,9
8,3
8,5
0,0
5,0
10,0
15,0
20,0
25,0
30,0
35,0
40,0
45,0
50,0
O
LA
P
r
es
po
ns
e
tim
e
in
cr
ea
se
(
%
)
512MB RAM 1GB RAM 2GB RAM
512MB RAM 8,8 14,6 21,1 29,9 38,5
1GB RAM 8,3 12,8 16,2 22,8 30,4
2GB RAM 8,5 11,5 14,0 18,7 25,9
9,38 sec 4,65 sec 2,32 sec 1,15 sec 0,57 sec
Figure 6. Average increase of the OLAP response time (Data
Warehouse dimension – RAM memory).
Figure 7. Average increase of the OLAP response time per
transactional frequency.
Data Warehouse 5 GBytes Data Warehouse 10 GBytes Data
Warehouse 20 GBytes
512 MBytes 1 GBytes 2 GBytes 512 MBytes 1 Gbyte 2 GBytes
512 MBytes 1 GByte 2 GBytes
9,38 seconds 8,8% 8,3% 8,5% nt nt nt nt nt nt
4,65 seconds 15,7% 14,3% 13,3% 13,4% 11,2% 9,7% nt nt nt
2,32 seconds 23,0% 20,4% 17,4% 22,6% 18,0% 15,4% 17,7%
10,1% 9,1%
1,15 seconds nt nt nt 31,9% 25,3% 20,1% 27,8% 20,2% 17,2%
0,57 seconds nt nt nt nt nt nt 38,5% 30,4% 25,9%
nt – not tested
6. CONCLUSIONS AND FUTURE WORK
This paper refers the necessary requirements for RTDW and
presents a methodology for supporting the implementation of
RTDW by enabling continuous data integration while
minimizing
impact in query execution on the user end of the DW. This is
achieved by data structure replication and adapting query
instructions in order to take advantage of the new real time data
warehousing schemas.
We have shown its functionality, recurring to a simulation using
the TPC-H benchmark, performing continuous data integration
at
various time rates against the execution of various simultaneous
query workloads, for data warehouses with different scale sizes.
All scenarios show that it is possible to achieve real-time data
warehousing performance in exchange for an average increase
of
query execution time. This should be considered the price to
pay
for real-time capability within the data warehouse.
As future work we intend to develop an ETL tool which will
integrate this methodology with extraction and transformation
routines for the OLTP systems. There is also room for
optimizing
the query instructions used for our methods.
7. REFERENCES
[1] D. J. Abadi, D. Carney, et al., 2003. “Aurora: A New Model
and Architecture for Data Stream Management”, The VLDB
Journal, 12(2), pp. 120-139.
[2] S. Babu, and J. Widom, 2001. “Continuous Queries Over
Data Streams”, SIGMOD Record 30(3), pp. 109-120.
[3] T. Binder, 2003. Gong User Manual, Tecco Software
Entwicklung AG.
[4] M. Bouzeghoub, F. Fabret, and M. Matulovic, 1999.
“Modeling Data Warehouse Refreshment Process as a
Workflow Application”, Intern. Workshop on Design and
Management of Data Warehouses (DMDW).
[5] R. M. Bruckner, B. List, and J. Schiefer, 2002. “Striving
Towards Near Real-Time Data Integration for Data
Warehouses”, International Conference on Data Warehousing
and Knowledge Discovery (DAWAK).
[6] R. M. Bruckner, and A. M. Tjoa, 2002. “Capturing Delays
and Valid Times in Data Warehouses – Towards Timely
Consistent Analyses”. Journal of Intelligent Information
Systems (JIIS), 19:2, pp. 169-190.
[7] S. Chaudhuri, and U. Dayal, 1997. “An Overview of Data
Warehousing and OLAP Technology”, SIGMOD Record,
Volume 26, Number 1, pp. 65-74.
[8] W. H. Inmon, R. H. Terdeman, J. Norris-Montanari, and D.
Meers, 2001. Data Warehousing for E-Business, J. Wiley &
Sons.
[9] I. C. Italiano, and J. E. Ferreira, 2006. “Synchronization
Options for Data Warehouse Designs”, IEEE Computer
Magazine.
[10] A. Karakasidis, P. Vassiliadis, and E. Pitoura, 2005. “ETL
Queues for Active Data Warehousing”, IQIS’05.
[11] R. Kimball, L. Reeves, M. Ross, and W. Thornthwaite,
1998.
The Data Warehouse Lifecycle Toolkit – Expert Methods for
Designing, Developing and Deploying Data Warehouses,
Wiley Computer Publishing.
[12] R. Kimball, and J. Caserta, 2004. The Data Warehouse ETL
Toolkit, Wiley Computer Publishing.
[13] E. Kuhn, 2003. “The Zero-Delay Data Warehouse:
Mobilizing Heterogeneous Databases”, International
Conference on Very Large Data Bases (VLDB).
[14] W. Labio, J. Yang, Y. Cui, H. Garcia-Molina, and J.
Widom,
2000. “Performance Issues in Incremental Warehouse
Maintenance”, International Conference on Very Large Data
Bases (VLDB).
[15] D. Lomet, and J. Gehrke, 2003. Special Issue on Data
Stream Processing, IEEE Data Eng. Bulletin, 26(1).
[16] Oracle Corporation, 2005. www.oracle.com
[17] T. B. Pedersen, 2004. “How is BI Used in Industry?”, Int.
Conf. on Data Warehousing and Knowledge Discovery
(DAWAK).
[18] J. F. Roddick, and M. Schrefl, 2000. “Towards an
Accommodation of Delay in Temporal Active Databases”,
11th ADC.
[19] A. Simitsis, P. Vassiliadis and T. Sellis, 2005. “Optimizing
ETL Processes in Data Warehouses”, Inte. Conference on
Data Engineering (ICDE).
Table 2. Characterization of the average increase of OLAP
response times for each experimental scenario.
[20] U. Srivastava, and J. Widom, 2004. “Flexible Time
Management in Data Stream Systems”, PODS.
[21] D. Theodoratus, and M. Bouzeghoub, 1999. “Data Currency
Quality Factors in Data Warehouse Design”, International
Workshop on the Design and Management of Data
Warehouses (DMDW).
[22] TPC-H decision support benchmark, Transaction
Processing
Council, www.tpc.com.
[23] P. Vassiliadis, Z. Vagena, S. Skiadopoulos, N.
Karayannidis,
and T. Sellis, 2001. “ARKTOS: Towards the Modelling,
Design, Control and Execution of ETL Processes”,
Information Systems, Vol. 26(8).
[24] C. White, 2002. “Intelligent Business Strategies: Real-
Time
Data Warehousing Heats Up”, DM Preview,
www.dmreview.com/article_sub_cfm?articleId=5570.
[25] J. Yang, 2001. “Temporal Data Warehousing”, Ph.D.
Thesis, Dpt. Computer Science, Stanford University.
[26] J. Yang, and J. Widom, 2001. “Incremental Computation
and Maintenance of Temporal Aggregates”, 17th Intern.
Conference on Data Engineering (ICDE).
[27] J. Yang, and J. Widom, 2001. “Temporal View Self-
Maintenance”, 7th Int. Conf. Extending Database Technology
(EDBT).
[28] T. Zurek, and K. Kreplin, 2001. “SAP Business
Information
Warehouse – From Data Warehousing to an E-Business
Platform”, 17th International Conference on Data Engineering
(ICDE).
An Oracle White Paper
August 2012
Best Practices for Real-time Data Warehousing
Best Practices for Real-time Data Warehousing
1
Executive Overview
Today’s integration project teams face the daunting challenge
that, while data volumes are
exponentially growing, the need for timely and accurate
business intelligence is also
constantly increasing. Batches for data warehouse loads used to
be scheduled daily to weekly;
today’s businesses demand information that is as fresh as
possible. The value of this real-
time business data decreases as it gets older, latency of data
integration is essential for the
business value of the data warehouse. At the same time the
concept of “business hours” is
vanishing for a global enterprise, as data warehouses are in use
24 hours a day, 365 days a
year. This means that the traditional nightly batch windows are
becoming harder to
accommodate, and interrupting or slowing down sources is not
acceptable at any time during
the day. Finally, integration projects have to be completed in
shorter release timeframes,
while fully meeting functional, performance, and quality
specifications on time and within
budget. These processes must be maintainable over time, and
the completed work should be
reusable for further, more cohesive, integration initiatives.
Conventional “Extract, Transform, Load” (ETL) tools closely
intermix data transformation
rules with integration process procedures, requiring the
development of both data
transformations and data flow. Oracle Data Integrator (ODI)
takes a different approach to
integration by clearly separating the declarative rules (the
“what”) from the actual
implementation (the “how”). With ODI, declarative rules
describing mappings and
transformations are defined graphically, through a drag-and-
drop interface, and stored
independently from the implementation. ODI automatically
generates the data flow, which
can be fine-tuned if required. This innovative approach for
declarative design has also been
applied to ODI's framework for Changed Data Capture (CDC).
ODI’s CDC moves only
changed data to the target systems and can be integrated with
Oracle GoldenGate, thereby
enabling the kind of real time integration that businesses
require.
This technical brief describes several techniques available in
ODI to adjust data latency from
scheduled batches to continuous real-time integration.
Best Practices for Real-time Data Warehousing
2
Introduction
The conventional approach to data integration involves
extracting all data from the source
system and then integrating the entire set—possibly using an
incremental strategy—in the
target system. This approach, which is suitable in most cases,
can be inefficient when the
integration process requires real-time data integration. In such
situations, the amount of data
involved makes data integration impossible in the given
timeframes.
Basic solutions, such as filtering records according to a
timestamp column or “changed” flag,
are possible, but they might require modifications in the
applications. In addition, they
usually do not sufficiently ensure that all changes are taken into
account.
ODI’s Changed Data Capture identifies and captures data as it
is being inserted, updated, or
deleted from datastores, and it makes the changed data available
for integration processes.
Real-Time Data Integration Use Cases
Integration teams require real-time data integration with low or
no data latency for a number
of use cases. While this whitepaper focuses on data
warehousing, it is useful to differentiate
the following areas:
- Real-time data warehousing
Aggregation of analytical data in a data warehouse using
continuous or near real-
time loads.
- Operational reporting and dashboards
Selection of operational data into a reporting database for BI
tools and dashboards.
- Query Offloading
Replication of high-cost or legacy OLTP servers to secondary
systems to ease query
load.
- High Availability / Disaster Recovery
Duplication of database systems in active-active or active-
passive scenarios to
improve availability during outages.
- Zero Downtime Migrations
Ability to synchronize data between old and new systems with
potentially different
technologies to allow for switch-over and switch-back without
downtime.
- Data Federation / Data Services
Provide virtual, canonical views of data distributed over several
systems through
federated queries over heterogeneous sources.
Best Practices for Real-time Data Warehousing
3
Oracle has various solutions for different real-time data
integration use cases. Query
offloading, high availability/disaster recovery, and zero-
downtime migrations can be handled
through the Oracle GoldenGate product that provides
heterogeneous, non-intrusive and
highly performant changed data capture, routing, and delivery.
In order to provide no to low
latency loads, ODI has various alternatives for real-time data
warehousing through the use of
CDC mechanism, including the integration with Oracle
GoldenGate. This integration also
provides seamless operational reporting. Data federation and
data service use cases are
covered by Oracle Data Service Integrator (ODSI).
Architectures for Loading Data Warehouses
Various architectures for collecting transactional data from
operational sources have been
used to populate data warehouses. These techniques vary mostly
on the latency of data
integration, from daily batches to continuous real-time
integration. The capture of data from
sources is either performed through incremental queries that
filter based on a timestamp or
flag, or through a CDC mechanism that detects any changes as it
is happening. Architectures
are further distinguished between pull and push operation,
where a pull operation polls in
fixed intervals for new data, while in a push operation data is
loaded into the target once a
change appears.
A daily batch mechanism is most suitable if intra-day freshness
is not required for the data,
such as longer-term trends or data that is only calculated once
daily, for example financial
close information. Batch loads might be performed in a
downtime window, if the business
model doesn’t require 24 hour availability of the data
warehouse. Different techniques such
as real-time partitioning or trickle-and-flip1 exist to minimize
the impact of a load to a live
data warehouse without downtime.
Batch Mini-Batch Micro-Batch Real-Time
Description Data is loaded in full
or incrementally
using a off-peak
window.
Data is loaded
incrementally
using intra-day
loads.
Source changes
are captured and
accumulated to
be loaded in
intervals.
Source changes
are captured and
immediately
applied to the
DW.
Latency Daily or higher Hourly or higher 15min & higher sub-
second
Capture Filter Query Filter Query CDC CDC
Intialization Pull Pull Push, then Pull Push
Target Load High Impact Low Impact, load frequency is
tuneable
Source Load High Impact Queries at peak
times necessary
Some to none depending on CDC
technique
1 See also: Real-Time Data Warehousing: Challenges and
Solution
s by Justin Langseth
(http://dssresources.com/papers/features/langseth/langseth02082
004.html)
Best Practices for Real-time Data Warehousing
4
IMPLEMENTING CDC WITH ODI
Change Data Capture as a concept is natively embedded in ODI.
It is controlled by the
modular Knowledge Module concept and supports different
methods of CDC. This chapter
describes the details and benefits of the ODI CDC feature.
Modular Framework for Different Load Mechanisms
ODI supports each of the described data warehouse load
architectures with its modular
Knowledge Module architecture. Knowledge Modules enable
integration designers to
separate the declarative rules of data mapping from their
technical implementation by
selecting a best practice mechanism for data integration. Batch
and Mini-Batch strategies can
be defined by selecting Load Knowledge Modules (LKM) for
the appropriate incremental
load from the sources. Micro-Batch and Real-Time strategies
use the Journalizing
Knowledge Modules (JKM) to select a CDC mechanism to
immediately access changes in
the data sources. Mapping logic can be left unchanged for
switching KM strategies, so that a
change in loading patterns and latency does not require a
rewrite of the integration logic.
Methods for Tracking Changes using CDC
ODI has abstracted the concept of CDC into a journalizing
framework with a JKM and
journalizing infrastructure at its core. By isolating the physical
specifics of the capture
process from the process of detected changes, it is possible to
support a number of different
techniques that are represented by individual JKMs:
Non-invasive CDC through Oracle GoldenGate
Oracle GoldenGate provides a CDC mechanism that can process
source changes non-
invasively by processing log files of completed transactions and
storing these captured
changes into external Trail Files independent of the database.
Changes are then reliably
transferred to a staging database. The JKM uses the metadata
managed by ODI to generate
Figure 1: GoldenGate-based CDC
J$ T
Source
Log
S
ODI
Load
S
GoldenGate
Target Staging
Real-Time Reporting
Best Practices for Real-time Data Warehousing
5
all Oracle GoldenGate configuration files, and processes all
GoldenGate-detected changes in
the staging area. These changes will be loaded into the target
data warehouse using ODI’s
declarative transformation mappings. This architecture enables
separate real-time reporting
on the normalized staging area tables in addition to loading and
transforming the data into
the analytical data warehouse tables.
Database Log Facilities
Some databases provide APIs and utilities to process table
changes programmatically. For
example, Oracle provides the Streams interface to process log
entries and store them in
separate tables. Such log-based JKMs have better scalability
than trigger-based mechanisms,
but still require changes to the source database. ODI also
supports log-based CDC on DB2
for iSeries using its journals.
Database Triggers
JKMs based on database triggers define procedures that are
executed inside the source
database when a table change occurs. Based on the wide
availability of trigger mechanisms in
databases, JKMs based on triggers are available for a wide
range of sources such as Oracle,
IBM DB2 for iSeries and UDB, Informix, Microsoft SQL
Server, Sybase, and others. The
disadvantage is the limited scalability and performance of
trigger procedures, making them
optimal for use cases with light to medium loads.
Figure 3: Trigger-based CDC
Trigger
S J$ T
Source Target
ODI Load
Figure 2: Streams-based CDC on Oracle
S J$ T
ODI Load
Log Oracle Streams
Target Source
Best Practices for Real-time Data Warehousing
6
Oracle Changed Data Capture Adapters
Oracle also provides separate CDC adapters that cover legacy
platforms such as DB2 for
z/OS, VSAM CICS, VSAM Batch, IMS/DB and Adabas. These
adapters provide
performance by capturing changes directly from the database
logs.
Source databases supported for ODI CDC
Database Log-based CDC Trigger-
based CDC
JKM Oracle
GoldenGate
Database
Log Facilities
Oracle CDC
Adapters
Oracle � � �
MS SQL Server � � �
Sybase ASE � �
DB2 UDB � �
DB2 for iSeries �2 � �
DB2 for z/OS �2 �
Informix, Hypersonic
DB
�
Teradata, Enscribe,
MySQL, SQL/MP,
SQL/MX
�2
VSAM CICS, VSAM
Batch, IMS/DB,
Adabas
�
2 Requires customization of Oracle GoldenGate configuration
generated by JKM
Best Practices for Real-time Data Warehousing
7
Publish-and-Subscribe Model
The ODI journalizing framework uses a publish-and-subscribe
model. This model works in
three steps:
1. An identified subscriber, usually an integration process,
subscribes to changes that
might occur in a datastore. Multiple subscribers can subscribe
to these changes.
2. The Changed Data Capture framework captures changes in
the datastore and then
publishes them for the subscriber.
3. The subscriber—an integration process—can process the
tracked changes at any
time and consume these events. Once consumed, events are no
longer available for
this subscriber.
ODI processes datastore changes in two ways:
- Regularly in batches (pull mode)—for example, processes new
orders from the
Web site every five minutes and loads them into the operational
datastore (ODS)
- In real time (push mode) as the changes occur—for example,
when a product is
changed in the enterprise resource planning (ERP) system,
immediately updates the
on-line catalog
Processing the Changes
ODI employs a powerful declarative design approach, Extract-
Load, Transform (E-LT),
which separates the rules from the implementation details. Its
out-of-the-box integration
interfaces use and process the tracked changes.
Developers define the declarative rules for the captured changes
within the integration processes
in the Designer Navigator of the ODI Studio graphical user
interface—without having to code.
In ODI Studio, customers declaratively specify set-based maps
between sources and targets, and
then the system automatically generates the data flow from the
set-based maps.
Figure 4: The ODI Journalizing Framework uses a publish-and-
subscribe architecture
Order #5A32
Orders
Order #5A32
CDC
Subscribe
Consume
Capture/Publish
Consume
Subscribe
Integration
Process 2
Target 1
Target 2
Integration
Process 1
Best Practices for Real-time Data Warehousing
8
The technical processes required for processing the changes
captured are implemented in
ODI’s Knowledge Modules. Knowledge Modules are scripted
modules that contain database
and application-specific patterns. The runtime then interprets
these modules and optimizes
the instructions for targets.
Ensuring Data Consistency
Changes frequently involve several datastores at one time. For
example, when an order is
created, updated, or deleted, it involves both the orders table
and the order lines table. When
processing a new order line, the new order to which this line is
related must be taken into
account.
ODI provides a mode of tracking changes, called Consistent Set
Changed Data Capture, for
this purpose. This mode allows you to process sets of changes
that guarantee data
consistency.
Best Practices using ODI for Real-Time Data Warehousing
As with other approaches there is no one-size-fits-all approach
when it comes to Real-Time
Data Warehousing. Much depends on the latency requirements,
overall data volume as well
as the daily change volume, load patterns on sources and
targets, as well as structure and
query requirements of the data warehouse. As covered in this
paper, ODI supports all
approaches of loading a data warehouse.
In practice there is one approach that satisfies the majority of
real-time data warehousing use
cases: The micro-batch approach using GoldenGate-based CDC
with ODI. In this
approach, one or more tables from operational databases are
used as sources and are
replicated by GoldenGate into a staging area database. This
staging area provides a real-time
copy of the transactional data for real-time reporting using BI
tools and dashboards. The
operational sources are not additionally stressed as GoldenGate
capture is non-invasive and
performant, and the separate staging area handles operational BI
queries without adding load
to the transactional system. ODI performs a load of the changed
records to the real-time
data warehouse in frequent periods of 15 minutes or more. This
pattern has demonstrated
the best combination of providing fresh, actionable data to the
data warehouse without
introducing inconsistencies in aggregates calculated in the data
warehouse.
Best Practices for Real-time Data Warehousing
9
Customer Use Case: Overstock.com
Overstock.com is using both GoldenGate and ODI to load
customer transaction data into a
real-time data warehouse. GoldenGate is used to capture
changes from its retail site, while
ODI is used for complex transformations in its data warehouse.
Overstock.com has seen the benefits of leveraging customer
data in real time. When the
company sends out an e-mail campaign, rather than waiting one,
two, or three days, it
immediately sees whether consumers are clicking in the right
place, if the e-mail is driving
consumers to the site, and if those customers are making
purchases. By accessing the data in
real time using Oracle GoldenGate and transforming it using
Oracle Data Integrator,
Overstock.com can immediately track customer behavior,
profitability of campaigns, and
ROI. The retailer can now analyze customer behavior and
purchase history to target
marketing campaigns and service in real time and provide better
service to its customers.
Conclusion
Integrating data and applications throughout the enterprise, and
presenting a consolidated
view of them, is a complex proposition. Not only are there
broad disparities in data
structures and application functionality, but there are also
fundamental differences in
integration architectures. Some integration needs are data
oriented, especially those involving
large data volumes. Other integration projects lend themselves
to an event-oriented
architecture for asynchronous or synchronous integration.
Changes tracked by Changed Data Capture constitute data
events. The ability to track these
events and process them regularly in batches or in real time is
key to the success of an event-
driven integration architecture. Oracle Data Integrator provides
rapid implementation and
maintenance for all types of integration projects.
Figure 5: Micro-Batch Architecture using ODI and GoldenGate
Operational
Source(s)
Log
Periodical
ODI Load
GoldenGate
Real-Time Data
Warehouse
Staging Area for
Operational BI
Operational BI /
Real-Time Reporting
Best Practices for Real-time Data Warehousing
August 2012
Author: Alex Kotopoulis
Oracle Corporation
World Headquarters
500 Oracle Parkway
Redwood Shores, CA 94065
U.S.A.
Worldwide Inquiries:
Phone: +1.650.506.7000
Fax: +1.650.506.7200
oracle.com
Copyright © 2010, Oracle and/or its affiliates. All rights
reserved. This document is provided for information purposes
only and
the contents hereof are subject to change without notice. This
document is not warranted to be error-free, nor subject to any
other
warranties or conditions, whether expressed orally or implied in
law, including implied warranties and conditions of
merchantability or
fitness for a particular purpose. We specifically disclaim any
liability with respect to this document and no contractual
obligations are
formed either directly or indirectly by this document. This
document may not be reproduced or transmitted in any form or
by any
means, electronic or mechanical, for any purpose, without our
prior written permission.
Oracle is a registered trademark of Oracle Corporation and/or
its affiliates. Other names may be trademarks of their respective
owners.
0510

More Related Content

Similar to Real-Time Data Warehouse Loading Methodology Ricardo Jorge S.docx

Enterprise application characteristics
Enterprise application characteristicsEnterprise application characteristics
Enterprise application characteristicsSalegram Padhee
 
Module 1_Data Warehousing Fundamentals.pptx
Module 1_Data Warehousing Fundamentals.pptxModule 1_Data Warehousing Fundamentals.pptx
Module 1_Data Warehousing Fundamentals.pptxnikshaikh786
 
Capacity Management of an ETL System
Capacity Management of an ETL SystemCapacity Management of an ETL System
Capacity Management of an ETL SystemASHOK BHATLA
 
Capacity management for ETL System
Capacity management for ETL SystemCapacity management for ETL System
Capacity management for ETL SystemASHOK BHATLA
 
Conspectus data warehousing appliances – fad or future
Conspectus   data warehousing appliances – fad or futureConspectus   data warehousing appliances – fad or future
Conspectus data warehousing appliances – fad or futureDavid Walker
 
MODERN DATA PIPELINE
MODERN DATA PIPELINEMODERN DATA PIPELINE
MODERN DATA PIPELINEIRJET Journal
 
Data warehousing interview_questionsandanswers
Data warehousing interview_questionsandanswersData warehousing interview_questionsandanswers
Data warehousing interview_questionsandanswersSourav Singh
 
Survey of streaming data warehouse update scheduling
Survey of streaming data warehouse update schedulingSurvey of streaming data warehouse update scheduling
Survey of streaming data warehouse update schedulingeSAT Journals
 
Struggling with data management
Struggling with data managementStruggling with data management
Struggling with data managementDavid Walker
 
Datawarehousing
DatawarehousingDatawarehousing
Datawarehousingsumit621
 
Enhancement techniques for data warehouse staging area
Enhancement techniques for data warehouse staging areaEnhancement techniques for data warehouse staging area
Enhancement techniques for data warehouse staging areaIJDKP
 

Similar to Real-Time Data Warehouse Loading Methodology Ricardo Jorge S.docx (20)

Enterprise application characteristics
Enterprise application characteristicsEnterprise application characteristics
Enterprise application characteristics
 
Updating and Scheduling of Streaming Web Services in Data Warehouses
Updating and Scheduling of Streaming Web Services in Data WarehousesUpdating and Scheduling of Streaming Web Services in Data Warehouses
Updating and Scheduling of Streaming Web Services in Data Warehouses
 
Module 1_Data Warehousing Fundamentals.pptx
Module 1_Data Warehousing Fundamentals.pptxModule 1_Data Warehousing Fundamentals.pptx
Module 1_Data Warehousing Fundamentals.pptx
 
Capacity Management of an ETL System
Capacity Management of an ETL SystemCapacity Management of an ETL System
Capacity Management of an ETL System
 
Capacity management for ETL System
Capacity management for ETL SystemCapacity management for ETL System
Capacity management for ETL System
 
Data Warehouse
Data WarehouseData Warehouse
Data Warehouse
 
Conspectus data warehousing appliances – fad or future
Conspectus   data warehousing appliances – fad or futureConspectus   data warehousing appliances – fad or future
Conspectus data warehousing appliances – fad or future
 
MODERN DATA PIPELINE
MODERN DATA PIPELINEMODERN DATA PIPELINE
MODERN DATA PIPELINE
 
Copy of sec d (2)
Copy of sec d (2)Copy of sec d (2)
Copy of sec d (2)
 
Copy of sec d (2)
Copy of sec d (2)Copy of sec d (2)
Copy of sec d (2)
 
Data warehousing interview_questionsandanswers
Data warehousing interview_questionsandanswersData warehousing interview_questionsandanswers
Data warehousing interview_questionsandanswers
 
[IJET-V1I5P5] Authors: T.Jalaja, M.Shailaja
[IJET-V1I5P5] Authors: T.Jalaja, M.Shailaja[IJET-V1I5P5] Authors: T.Jalaja, M.Shailaja
[IJET-V1I5P5] Authors: T.Jalaja, M.Shailaja
 
Data warehouse
Data warehouseData warehouse
Data warehouse
 
Survey of streaming data warehouse update scheduling
Survey of streaming data warehouse update schedulingSurvey of streaming data warehouse update scheduling
Survey of streaming data warehouse update scheduling
 
Event Driven Architecture
Event Driven ArchitectureEvent Driven Architecture
Event Driven Architecture
 
Struggling with data management
Struggling with data managementStruggling with data management
Struggling with data management
 
Issue in Data warehousing and OLAP in E-business
Issue in Data warehousing and OLAP in E-businessIssue in Data warehousing and OLAP in E-business
Issue in Data warehousing and OLAP in E-business
 
Datawarehousing
DatawarehousingDatawarehousing
Datawarehousing
 
Data warehouse presentation
Data warehouse presentationData warehouse presentation
Data warehouse presentation
 
Enhancement techniques for data warehouse staging area
Enhancement techniques for data warehouse staging areaEnhancement techniques for data warehouse staging area
Enhancement techniques for data warehouse staging area
 

More from sodhi3

A brief description of your employment historyYour career .docx
A brief description of your employment historyYour career .docxA brief description of your employment historyYour career .docx
A brief description of your employment historyYour career .docxsodhi3
 
A budget is a plan expressed in dollar amounts that acts as a ro.docx
A budget is a plan expressed in dollar amounts that acts as a ro.docxA budget is a plan expressed in dollar amounts that acts as a ro.docx
A budget is a plan expressed in dollar amounts that acts as a ro.docxsodhi3
 
A 72-year-old male with a past medical history for hypertension, con.docx
A 72-year-old male with a past medical history for hypertension, con.docxA 72-year-old male with a past medical history for hypertension, con.docx
A 72-year-old male with a past medical history for hypertension, con.docxsodhi3
 
a able aboutaccomplishaccomplishmentachieveachieving.docx
a able aboutaccomplishaccomplishmentachieveachieving.docxa able aboutaccomplishaccomplishmentachieveachieving.docx
a able aboutaccomplishaccomplishmentachieveachieving.docxsodhi3
 
a brief explanation of the effect of Apartheid in South Africa. Prov.docx
a brief explanation of the effect of Apartheid in South Africa. Prov.docxa brief explanation of the effect of Apartheid in South Africa. Prov.docx
a brief explanation of the effect of Apartheid in South Africa. Prov.docxsodhi3
 
A 32-year-old female presents to the ED with a chief complaint of fe.docx
A 32-year-old female presents to the ED with a chief complaint of fe.docxA 32-year-old female presents to the ED with a chief complaint of fe.docx
A 32-year-old female presents to the ED with a chief complaint of fe.docxsodhi3
 
A 4 years old is brought to the clinic by his parents with abdominal.docx
A 4 years old is brought to the clinic by his parents with abdominal.docxA 4 years old is brought to the clinic by his parents with abdominal.docx
A 4 years old is brought to the clinic by his parents with abdominal.docxsodhi3
 
A 19-year-old male complains of burning sometimes, when I pee.”.docx
A 19-year-old male complains of burning sometimes, when I pee.”.docxA 19-year-old male complains of burning sometimes, when I pee.”.docx
A 19-year-old male complains of burning sometimes, when I pee.”.docxsodhi3
 
A 34-year-old trauma victim, the Victor, is unconscious and on a.docx
A 34-year-old trauma victim, the Victor, is unconscious and on a.docxA 34-year-old trauma victim, the Victor, is unconscious and on a.docx
A 34-year-old trauma victim, the Victor, is unconscious and on a.docxsodhi3
 
A 27-year-old Vietnamese woman in the delivery room with very st.docx
A 27-year-old Vietnamese woman in the delivery room with very st.docxA 27-year-old Vietnamese woman in the delivery room with very st.docx
A 27-year-old Vietnamese woman in the delivery room with very st.docxsodhi3
 
A 25 year old male presents with chronic sinusitis and allergic .docx
A 25 year old male presents with chronic sinusitis and allergic .docxA 25 year old male presents with chronic sinusitis and allergic .docx
A 25 year old male presents with chronic sinusitis and allergic .docxsodhi3
 
A 500-700 word APA formatted PaperInclude 2 sources on your re.docx
A 500-700 word APA formatted PaperInclude 2 sources on your re.docxA 500-700 word APA formatted PaperInclude 2 sources on your re.docx
A 500-700 word APA formatted PaperInclude 2 sources on your re.docxsodhi3
 
A 65-year-old obese African American male patient presents to his HC.docx
A 65-year-old obese African American male patient presents to his HC.docxA 65-year-old obese African American male patient presents to his HC.docx
A 65-year-old obese African American male patient presents to his HC.docxsodhi3
 
A 5-year-old male is brought to the primary care clinic by his m.docx
A 5-year-old male is brought to the primary care clinic by his m.docxA 5-year-old male is brought to the primary care clinic by his m.docx
A 5-year-old male is brought to the primary care clinic by his m.docxsodhi3
 
92 S C I E N T I F I C A M E R I C A N R e p r i n t e d f r.docx
92 S C I E N T I F I C  A M E R I C A N R e p r i n t e d  f r.docx92 S C I E N T I F I C  A M E R I C A N R e p r i n t e d  f r.docx
92 S C I E N T I F I C A M E R I C A N R e p r i n t e d f r.docxsodhi3
 
a 100 words to respond to each question. Please be sure to add a que.docx
a 100 words to respond to each question. Please be sure to add a que.docxa 100 words to respond to each question. Please be sure to add a que.docx
a 100 words to respond to each question. Please be sure to add a que.docxsodhi3
 
A 12,000 word final dissertation for Masters in Education project. .docx
A 12,000 word final dissertation for Masters in Education project. .docxA 12,000 word final dissertation for Masters in Education project. .docx
A 12,000 word final dissertation for Masters in Education project. .docxsodhi3
 
918191ISMM1-UC 752SYSTEMS ANALYSISFall 2019 –.docx
918191ISMM1-UC 752SYSTEMS ANALYSISFall 2019 –.docx918191ISMM1-UC 752SYSTEMS ANALYSISFall 2019 –.docx
918191ISMM1-UC 752SYSTEMS ANALYSISFall 2019 –.docxsodhi3
 
915Rising Up from a Sea of DiscontentThe 1970 Koza.docx
915Rising Up from a Sea of DiscontentThe 1970 Koza.docx915Rising Up from a Sea of DiscontentThe 1970 Koza.docx
915Rising Up from a Sea of DiscontentThe 1970 Koza.docxsodhi3
 
96 Young Scholars in WritingFeminist Figures or Damsel.docx
96    Young Scholars in WritingFeminist Figures or Damsel.docx96    Young Scholars in WritingFeminist Figures or Damsel.docx
96 Young Scholars in WritingFeminist Figures or Damsel.docxsodhi3
 

More from sodhi3 (20)

A brief description of your employment historyYour career .docx
A brief description of your employment historyYour career .docxA brief description of your employment historyYour career .docx
A brief description of your employment historyYour career .docx
 
A budget is a plan expressed in dollar amounts that acts as a ro.docx
A budget is a plan expressed in dollar amounts that acts as a ro.docxA budget is a plan expressed in dollar amounts that acts as a ro.docx
A budget is a plan expressed in dollar amounts that acts as a ro.docx
 
A 72-year-old male with a past medical history for hypertension, con.docx
A 72-year-old male with a past medical history for hypertension, con.docxA 72-year-old male with a past medical history for hypertension, con.docx
A 72-year-old male with a past medical history for hypertension, con.docx
 
a able aboutaccomplishaccomplishmentachieveachieving.docx
a able aboutaccomplishaccomplishmentachieveachieving.docxa able aboutaccomplishaccomplishmentachieveachieving.docx
a able aboutaccomplishaccomplishmentachieveachieving.docx
 
a brief explanation of the effect of Apartheid in South Africa. Prov.docx
a brief explanation of the effect of Apartheid in South Africa. Prov.docxa brief explanation of the effect of Apartheid in South Africa. Prov.docx
a brief explanation of the effect of Apartheid in South Africa. Prov.docx
 
A 32-year-old female presents to the ED with a chief complaint of fe.docx
A 32-year-old female presents to the ED with a chief complaint of fe.docxA 32-year-old female presents to the ED with a chief complaint of fe.docx
A 32-year-old female presents to the ED with a chief complaint of fe.docx
 
A 4 years old is brought to the clinic by his parents with abdominal.docx
A 4 years old is brought to the clinic by his parents with abdominal.docxA 4 years old is brought to the clinic by his parents with abdominal.docx
A 4 years old is brought to the clinic by his parents with abdominal.docx
 
A 19-year-old male complains of burning sometimes, when I pee.”.docx
A 19-year-old male complains of burning sometimes, when I pee.”.docxA 19-year-old male complains of burning sometimes, when I pee.”.docx
A 19-year-old male complains of burning sometimes, when I pee.”.docx
 
A 34-year-old trauma victim, the Victor, is unconscious and on a.docx
A 34-year-old trauma victim, the Victor, is unconscious and on a.docxA 34-year-old trauma victim, the Victor, is unconscious and on a.docx
A 34-year-old trauma victim, the Victor, is unconscious and on a.docx
 
A 27-year-old Vietnamese woman in the delivery room with very st.docx
A 27-year-old Vietnamese woman in the delivery room with very st.docxA 27-year-old Vietnamese woman in the delivery room with very st.docx
A 27-year-old Vietnamese woman in the delivery room with very st.docx
 
A 25 year old male presents with chronic sinusitis and allergic .docx
A 25 year old male presents with chronic sinusitis and allergic .docxA 25 year old male presents with chronic sinusitis and allergic .docx
A 25 year old male presents with chronic sinusitis and allergic .docx
 
A 500-700 word APA formatted PaperInclude 2 sources on your re.docx
A 500-700 word APA formatted PaperInclude 2 sources on your re.docxA 500-700 word APA formatted PaperInclude 2 sources on your re.docx
A 500-700 word APA formatted PaperInclude 2 sources on your re.docx
 
A 65-year-old obese African American male patient presents to his HC.docx
A 65-year-old obese African American male patient presents to his HC.docxA 65-year-old obese African American male patient presents to his HC.docx
A 65-year-old obese African American male patient presents to his HC.docx
 
A 5-year-old male is brought to the primary care clinic by his m.docx
A 5-year-old male is brought to the primary care clinic by his m.docxA 5-year-old male is brought to the primary care clinic by his m.docx
A 5-year-old male is brought to the primary care clinic by his m.docx
 
92 S C I E N T I F I C A M E R I C A N R e p r i n t e d f r.docx
92 S C I E N T I F I C  A M E R I C A N R e p r i n t e d  f r.docx92 S C I E N T I F I C  A M E R I C A N R e p r i n t e d  f r.docx
92 S C I E N T I F I C A M E R I C A N R e p r i n t e d f r.docx
 
a 100 words to respond to each question. Please be sure to add a que.docx
a 100 words to respond to each question. Please be sure to add a que.docxa 100 words to respond to each question. Please be sure to add a que.docx
a 100 words to respond to each question. Please be sure to add a que.docx
 
A 12,000 word final dissertation for Masters in Education project. .docx
A 12,000 word final dissertation for Masters in Education project. .docxA 12,000 word final dissertation for Masters in Education project. .docx
A 12,000 word final dissertation for Masters in Education project. .docx
 
918191ISMM1-UC 752SYSTEMS ANALYSISFall 2019 –.docx
918191ISMM1-UC 752SYSTEMS ANALYSISFall 2019 –.docx918191ISMM1-UC 752SYSTEMS ANALYSISFall 2019 –.docx
918191ISMM1-UC 752SYSTEMS ANALYSISFall 2019 –.docx
 
915Rising Up from a Sea of DiscontentThe 1970 Koza.docx
915Rising Up from a Sea of DiscontentThe 1970 Koza.docx915Rising Up from a Sea of DiscontentThe 1970 Koza.docx
915Rising Up from a Sea of DiscontentThe 1970 Koza.docx
 
96 Young Scholars in WritingFeminist Figures or Damsel.docx
96    Young Scholars in WritingFeminist Figures or Damsel.docx96    Young Scholars in WritingFeminist Figures or Damsel.docx
96 Young Scholars in WritingFeminist Figures or Damsel.docx
 

Recently uploaded

Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
MICROBIOLOGY biochemical test detailed.pptx
MICROBIOLOGY biochemical test detailed.pptxMICROBIOLOGY biochemical test detailed.pptx
MICROBIOLOGY biochemical test detailed.pptxabhijeetpadhi001
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Jisc
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Celine George
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxRaymartEstabillo3
 
Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxEyham Joco
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceSamikshaHamane
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Celine George
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
CELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxCELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxJiesonDelaCerna
 

Recently uploaded (20)

Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
MICROBIOLOGY biochemical test detailed.pptx
MICROBIOLOGY biochemical test detailed.pptxMICROBIOLOGY biochemical test detailed.pptx
MICROBIOLOGY biochemical test detailed.pptx
 
ESSENTIAL of (CS/IT/IS) class 06 (database)
ESSENTIAL of (CS/IT/IS) class 06 (database)ESSENTIAL of (CS/IT/IS) class 06 (database)
ESSENTIAL of (CS/IT/IS) class 06 (database)
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
 
Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptx
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in Pharmacovigilance
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17
 
OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
CELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxCELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptx
 

Real-Time Data Warehouse Loading Methodology Ricardo Jorge S.docx

  • 1. Real-Time Data Warehouse Loading Methodology Ricardo Jorge Santos CISUC – Centre of Informatics and Systems DEI – FCT – University of Coimbra Coimbra, Portugal [email protected] Jorge Bernardino CISUC, IPC – Polytechnic Institute of Coimbra ISEC – Superior Institute of Engineering of Coimbra Coimbra, Portugal [email protected] ABSTRACT A data warehouse provides information for analytical processing, decision making and data mining tools. As the concept of real- time enterprise evolves, the synchronism between transactional data and data warehouses, statically implemented, has been redefined. Traditional data warehouse systems have static structures of their schemas and relationships between data, and therefore are not able to support any dynamics in their structure and content. Their data is only periodically updated because they are not prepared for continuous data integration. For real-time enterprises with needs in decision support purposes, real-time data warehouses seem to be very promising. In this paper we present a
  • 2. methodology on how to adapt data warehouse schemas and user- end OLAP queries for efficiently supporting real-time data integration. To accomplish this, we use techniques such as table structure replication and query predicate restrictions for selecting data, to enable continuously loading data in the data warehouse with minimum impact in query execution time. We demonstrate the efficiency of the method by analyzing its impact in query performance using benchmark TPC-H executing query workloads while simultaneously performing continuous data integration at various insertion time rates. Keywords real-time and active data warehousing, continuous data integration for data warehousing, data warehouse refreshment loading process. 1. INTRODUCTION A data warehouse (DW) provides information for analytical processing, decision making and data mining tools. A DW collects data from multiple heterogeneous operational source systems (OLTP – On-Line Transaction Processing) and stores summarized integrated business data in a central repository used by analytical applications (OLAP – On-Line Analytical Processing) with different user requirements. The data area of a data warehouse usually stores the complete history of a business. The common process for obtaining decision making information is based on using OLAP tools [7]. These tools have their data source based on the DW data area, in which records are updated by ETL (Extraction, Transformation and Loading) tools. The ETL processes are responsible for identifying and extracting the relevant data from the OLTP source systems, customizing and
  • 3. integrating this data into a common format, cleaning the data and conforming it into an adequate integrated format for updating the data area of the DW and, finally, loading the final formatted data into its database. Traditionally, it has been well accepted that data warehouse databases are updated periodically – typically in a daily, weekly or even monthly basis [28] – implying that its data is never up- to- date, for OLTP records saved between those updates are not included the data area. This implies that the most recent operational records are not included into the data area, thus getting excluded from the results supplied by OLAP tools. Until recently, using periodically updated data was not a crucial issue. However, with enterprises such as e-business, stock brokering, online telecommunications, and health systems, for instance, relevant information needs to be delivered as fast as possible to knowledge workers or decision systems who rely on it to react in a near real-time manner, according to the new and most recent data captured by an organization’s information system [8]. This makes supporting near real-time data warehousing (RTDW) a critical issue for such applications. The demand for fresh data in data warehouses has always been a strong desideratum. Data warehouse refreshment (integration of new data) is traditionally performed in an off-line fashion. This means that while processes for updating the data area are executed, OLAP users and applications cannot access any data. This set of activities usually takes place in a preset loading time window, to avoid overloading the operational OLTP source systems with the extra workload of this workflow. Still, users
  • 4. are pushing for higher levels of freshness, since more and more enterprises operate in a business time schedule of 24x7. Active Data Warehousing refers to a new trend where DWs are updated as frequently as possible, due to the high demands of users for fresh data. The term is also designated as Real-Time Data Warehousing for that reason in [24]. The conclusions presented by T. B. Pedersen in a report from a knowledge exchange network formed by several major technological partners in Denmark [17] refer that all partners agree real-time enterprise and continuous data availability is considered a short term priority for many business and general data-based enterprises. Nowadays, IT managers are facing crucial challenges deciding whether to build a real-time data warehouse instead of a conventional one and whether their existing data warehouse is going out of style and needs to be converted into a real-time data warehouse to remain competitive. In some specific cases, data update delays larger than a few seconds or minutes may jeopardise the usefulness of the whole system. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. IDEAS’08, September 10–12, 2008, Coimbra, Portugal
  • 5. Editor: Bipin C. DESAI Copyright 2008 ACM 978-1-60158-188-0/08/09…$5.00 In a nutshell, accomplishing near zero latency between OLTP and OLAP systems consists in insuring continuous data integration from the former type of systems into the last. In order to make this feasible, several issues need to be taken under consideration: (1) Operational OLTP systems are designed to meet well- specified (short) response time requirements aiming for maximum system availability, which means that a RTDW scenario would have to cope with this in the overhead implied in those OLTP systems; (2) The tables existing in a data warehouse’s database directly related with transactional records (commonly named as fact tables) are usually huge in size, and therefore, the addition of new data and consequent procedures such as index updating or referential integrity checks would certainly have impact in OLAP systems’ performance and data availability. Our work is focused on the DW perspective, for that reason we present an efficient methodology for continuous data integration, performing the ETL loading process. This paper presents a solution which enables efficient continuous data integration in data warehouses, while allowing OLAP execution simultaneously, with minimum decrease of performance. With this, we seek to minimize the delay between the recording of transactional information and its reflected update
  • 6. on the decision support database. If the operational data extraction and transformation are able of performing without significant delay during the transaction itself, this solution will load all the decision support information needed in the data warehouse in useful time, allowing an answer while the transaction still is occurring. This is the general concept of real time data warehousing. The issues focused in this paper concern the DW end of the system, referring how to perform loading processes of ETL procedures and the DW’s data area usage for efficiently supporting continuous data integration. The items concerning extracting and transforming of operational (OLTP) source systems data are not the focus of this paper. Therefore, we shall always be referring the data warehouse point of view, in the remainder of the paper. Based on the existing schema(s) of the DW’s database, our methodology consists on creating a replica for each of its tables, empty of contents and without defining any type of indexes, primary keys or any other kind of restrictions or constraints. In this new schema, the replicated tables will receive and record the data from the staging area, which will be continuously loaded. The fact that they are initially created empty of contents and not having any sort of constraints allows to considerably minimize the consumption of time and resources which are necessary in the procedures inherent to data integration. As the data integration is performed, the database’s performance which includes the new most recent data deteriorates, due to the lack of usual data structures that could optimize it (such as indexes, for example), in the replicated tables. When the data warehouse’s performance is considered as unacceptable by its users or administrators, the
  • 7. existing data within the replicated tables should serve for updating the original schema. After performing this update, the replicated tables are to be recreated empty of contents, regaining maximum performance. We also demonstrate how to adapt OLAP queries in the new schema, to take advantage of the most recent data which is integrated in real time. The experiments which are presented were performed using a standard benchmark for decision support systems, benchmark TPC-H, from the Transaction Processing Council [22]. Several configurations were tested, varying items such as the database’s physical size, the number of simultaneous users executing OLAP queries, the amount of available RAM memory and the time rates between transactions. The remainder of this paper is organized as follows. In section 2, we refer the requirements for real-time data warehousing. Section 3 presents background and related work in real-time data warehousing. Section 4 explains our methodology, and in section 5 we present the experimental evaluation of our methods and demonstrate its functionality. The final section contains concluding remarks and future work 2. REQUIREMENTS FOR REAL-TIME DATA WAREHSOUSING
  • 8. Nowadays, organizations generate large amounts of data, at a rate which can easily reach several megabytes or gigabytes per day. Therefore, as time goes by, a business data warehouse can easily grow to terabytes or even petabytes. The size of data warehouses imply that each query which is executed against its data area usually accesses large amounts of records, also performing actions such as joins, sorting, grouping and calculation functions. To optimize these accesses, the data warehouse uses predefined internal data structures (such as indexes or partitions, for instance), which are also large in size and have a very considerable level of complexity. These facts imply that it is very difficult to efficiently update the data warehouse’s data area in real-time, for the propagation of transactional data in real-time would most likely overload the server, given its update frequency and volume; it would involve immense complex operations on the data warehouse’s data structures and dramatically degrade OLAP performance. In a nutshell, real-time data warehouses aim for decreasing the time it takes to make decisions and try to attain zero latency between cause and effect for that decision, closing the gap between intelligent reactive systems and systems processes. Our aim is transforming a standard DW using batch loading during update windows (during which analytical access is not allowed) into near zero latency analytical environment providing current data, in order to enable (near) real-time dissemination of new information across an organization. The business requirements for
  • 9. this kind of analytical environment introduce a set of service level agreements that go beyond what is typical in a traditional DW. The major issue is how to enable continuous data integration and assuring that it minimizes negative impact in several main features of the system, such as availability and response time of both OLTP and OLAP systems. An in-depth discussion of these features from the analytical point of view (to enable timely consistent analysis) is given in [6]. Combining highly available systems with active decision engines allows near real-time information dissemination for data warehouses. Cumulatively, this is the basis for zero latency analytical environments [6]. The real-time data warehouse provides access to an accurate, integrated, consolidated view of the organization’s information and helps to deliver near real- time information to its users. This requires efficient ETL techniques enabling continuous data integration, the focus of this paper. By adopting real-time data warehousing, it becomes necessary to cope with at least two radical data state changes. First, it is necessary to perform continuous data update actions, due to the continuous data integration, which should mostly concern row insertions. Second, these update actions must be performed in parallel with the execution of OLAP, which – due to its new real- time nature – will probably be solicited more often. Therefore, the main contributions of this paper are threefold:
  • 10. • Maximizing the freshness of data by efficiently and rapidly integrating most recent OLTP data into the data warehouse; • Minimizing OLAP response time while simultaneously performing continuous data integration; • Maximizing the data warehouse’s availability by reducing its update time window, in which users and OLAP applications are off-line. 3. RELATED WORK So far, research has mostly dealt with the problem of maintaining the warehouse in its traditional periodical update setup [14, 27]. Related literature presents tools and algorithms to populate the warehouse in an off-line fashion. In a different line of research, data streams [1, 2, 15, 20] could possibly appear as a potential solution. However, research in data streams has focused on topics concerning the front-end, such as on-the-fly computation of queries without a systematic treatment of the issues raised at the back-end of a data warehouse [10]. Much of the recent work dedicated to RTDW is also focused on conceptual ETL modelling [4, 5, 19, 23], lacking the presentation of concrete specific extraction, transformation and loading algorithms along with their consequent OLTP and OLAP performance issues. Temporal data warehouses address the issue of supporting temporal information efficiently in data warehousing systems [25]. In [27], the authors present efficient techniques (e.g. temporal view self-maintenance) for maintaining data warehouses
  • 11. without disturbing source operations. A related challenge is supporting large-scale temporal aggregation operations in data warehouses [26]. In [4], the authors describe an approach which clearly separates the DW refreshment process from its traditional handling as a view maintenance or bulk loading process. They provide a conceptual model of the process, which is treated as a composite workflow, but they do not describe how to efficiently propagate the date. Theodoratus et al. discuss in [21] data currency quality factors in data warehouses and propose a DW design that considers these factors. An important issue for near real-time data integration is the accommodation of delays, which has been investigated for (business) transactions in temporal active databases [18]. The conclusion is that temporal faithfulness for transactions has to be provided, which preserves the serialization order of a set of business transactions. Although possibly lagging behind real- time, a system that behaves in a temporally faithful manner guarantees the expected serialization order. In [23], the authors describe the ARKTOS ETL tool, capable of modeling and executing practical ETL scenarios by providing explicit primitives for capturing common tasks (such as data cleaning, scheduling and data transformations) using a declarative language. ARKTOS offers graphical and declarative features for defining DW transformations and tries to optimize the execution of complex sequences of transformation and cleansing tasks. In [13] is described a zero-delay DW with Gong, which assists in
  • 12. providing confidence in the data available to every branch of the organization. Gong is a Tecco product [3] that offers uni or bi- directional replication of data between homogeneous and/or heterogeneous distributed databases. Gong’s database replication enables zero-delay business in order to assist in the daily running and decision making of the organization. But not all transactional informations need to be immediately dealt with in real-time decision making requirements. We can define which groups of data is more important to include rapidly in the data warehouse and other groups of data which can be updated in latter time. Recently, in [9], the authors present an interesting architecture on how to define the types of update and time priorities (immediate, at specific time intervals or only on data warehouse offline updates) and respective synchronization for each group of transactional data items. Our methodology overcomes some of the mentioned drawbacks and is presented in the next section. 4. CONTINUOUS DATA WAREHOUSE LOADING METHODOLOGY From the DW side, updating huge tables and related structures (such as indexes, materialized views and other integrated components) makes executing OLAP query workloads simultaneously with continuous data integration a very difficult task. Our methodology minimizes the processing time and workload required for these update processes. It also facilitates the DW off-line update (see section 4.4), because the data already lies within the data area and all OLTP data extraction and/or transformation routines have been executed during the
  • 13. continuous data integration. Furthermore, the data structure of the replicated tables is exactly the same as the original DW schema. This minimizes the time window for packing the data area of the DW, since its update represents a one step process by resuming itself as a cut-and-paste action from the temporary tables to the original ones, as we shall demonstrate further. Our methodology is focused on four major areas: (1) data warehouse schema adaptation; (2) ETL loading procedures; (3) OLAP query adaptation; and (4) DW database packing and reoptimization. It is mainly based on a very simple principle: new row insertion procedures in tables with few (or no) contents are performed much faster than in big size tables. It is obvious and undeniable that data handling in small sized tables are much less complex and much faster than in large sized tables. As a matter of fact, this is mainly the reason why OLTP data sources are maintained with the fewer amount possible of necessary records, in order to maximize its availability. The proposed continuous data warehouse loading methodology is presented in Figure 1. 4.1 Adapting the Data Warehouse Schema Suppose a very simple sales data warehouse with the schema illustrated in Figure 2, having two dimensional tables (Store and Customer, representing business descriptor entities) and one fact table (Sales, storing business measures aggregated from
  • 14. transactions). To simplify the figure, the Date dimension is not shown. This DW allows storing the sales value per store, per customer, per day. The primary keys are represented in bold, while the referential integrity constraints with foreign keys are represented in italic. The factual attribute S_Value is additive. This property in facts is very important for our methodology, as we shall demonstrate further on. For the area concerning data warehouse schema adaptation, we adopt the following method: Data warehouse schema adaptation method for supporting real-time data warehousing: Creation of an exact structural replica of all the tables of the data warehouse that could eventually receive new data. These tables (referred also as temporary tables) are to be created empty of contents, with no defined indexes, primary key, or constraints of any kind, including referential integrity. For each table, an extra attribute must be created, for storing a unique sequential identifier related to the insertion of each row within the temporary tables. The modified schema for supporting RTDW based on our methodology is illustrated in Figure 3. The unique sequential identifier attribute present in each temporary table should record the sequence in which each row is appended in the Data Area. This will allow identifying the exact sequence of arrival for each new inserted row. This is useful for restoring prior data states in disaster recovery procedures, and also for discarding dimensional rows which have more recent updates. For instance, if the same customer has had two updates in the OLTP systems which, consequently, lead to the insertion of two new rows in the
  • 15. temporary table CustomerTmp, only the most recent one is relevant. This can be defined by considering as most recent the row with highest CTmp_Counter for that same customer (CTmp_CustKey). CustomerTmp CTmp_CustKey CTmp_Name CTmp_Address CTmp_PostalCode CTmp_Phone CTmp_EMail CTmp_Counter SalesTmp STmp_StoreKey STmp_CustomerKey STmp_Date STmp_Value STmp_Counter StoreTmp StTmp_StoreKey StTmp_Description StTmp_Address StTmp_PostalCode StTmp_Phone StTmp_EMail StTmp_Manager StTmp_Counter
  • 16. The authors of the ARKTOS tool [23] refer that their own experience, as well as the most recent literature, suggests that the main problems of ETL tools do not consist only in performance problems (as normally would be expected), but also in aspects such as complexity, practibility and price. By performing only record insertion procedures inherent to continuous data integration using empty or small sized tables without any kind of constraint or attached physical file related to it, we guarantee the simplest and fastest logical and physical support for achieving our goals [12]. The fact that the only significant change in the logical and physical structure of the data warehouse’s schema is the simple adaptation shown in Figure 3, allows that the implementation of the necessary ETL procedures can be made in a manner to maximize its operationability. Data loading may be done by simple standard SQL instructions or DBMS batch loading software such as SQL*Loader [16], with a minimum of complexity. There is no need for developing complex routines for updating the data area, in which the needed data for is easily accessible, independently from the used ETL tools. 4.2 ETL Loading Procedures To refresh the data warehouse, once the ETL application has extracted and transformed the OLTP data into the correct format for loading the data area of the DW, it shall proceed immediately in inserting that record as a new row in the correspondent temporary table, filling the unique sequential identifier attribute with the autoincremented sequential number. This number should
  • 17. start at 1 for the first record to insert in the DW after executing Figure 1. General architecture of the proposed continuous data warehouse loading methodology. Figure 2. Sample sales data warehouse schema. Figure 3. Sample sales data warehouse schema modified for supporting real-time data warehousing. the packing and reoptimizing technique (explained in section 4.4 of this paper), and then be autoincremented by one unit for each record insertion. The algorithm for accomplishing continuous data integration by the ETL tool may be similar to: Trigger for each new record in OLTP system (after it is commited) Extract new record from OLTP system Clean and transform the OLTP data, shaping it into the data warehouse destination table’s format Increment record insertion unique counter Create a new record in the data warehouse temporary destination table Insert the data in the temporary destination table’s new record, along with the value of the record insertion unique counter End_Trigger
  • 18. Following, we demonstrate a practical example for explaining situations regarding updating the data warehouse shown in Figure 3. Figure 4 presents the insertion of a row in the data warehouse temporary fact table for the recording of a sales transaction of value 100 which took place at 2008-05-02 in store with St_StoreKey = 1 related to customer with C_CustKey = 10, identified by STmp_Counter = 1001. Meanwhile, other transactions occurred, and the organization’s OLTP system recorded that instead of a value of 100 for the mentioned transaction, it should be 1000. The rows in the temporary fact table with STmp_Counter = 1011 and STmp_Counter = 1012 reflect this modification of values. The first eliminates the value of the initial transactional row and the second has the new real value, due to the additivity of the STmp_Value attribute. The definition of which attributes are additive and which are not should be the responsibility of the Database Administrator. According to [11], the most useful facts in a data warehouse are numeric and additive. The method for data loading uses the most simple method for writing data: appending new records. Any other type of writing method needs the execution of more time consuming and complex tasks. STmp_StoreKey STmp_CustomerKey STmp_Date STmp_Value STmp_Counter 1 10 2008-05-02 100 1001
  • 19. 1 10 2008-05-02 -100 1011 1 10 2008-05-02 1000 1012 4.3 OLAP Query Adaptation Consider the following OLAP query, for calculating the total revenue per store in the last seven days. SELECT S_StoreKey, Sum(S_Value) AS Last7DaysSaleVal FROM Sales WHERE S_Date>=SystemDate()-7 GROUP BY S_StoreKey To take advantage of our schema modification method and include the most recent data in the OLAP query response, queries should be rewritten taking under consideration the following rule: the FROM clause should join all rows from the required original and temporary tables with relevant data, excluding all fixed restriction predicate values from the WHERE clause whenever possible. The modification for the prior instruction is illustrated below, with respect to our methodology. It can be seen that the relevant rows from both issue tables are joined for supplying
  • 20. the OLAP query answer, filtering the rows used in the resulting dataset according to its restrictions in the original instruction. SELECT S_StoreKey, Sum(S_Value) AS Last7DaysSaleVal FROM (SELECT S_StoreKey, S_Value FROM Sales WHERE S_Date>=SystemDate()-7) UNION ALL (SELECT STmp_StoreKey, STmp_Value FROM SalesTmp WHERE STmp_Date>=SystemDate()-7) GROUP BY S_StoreKey An interesting and relevant aspect of the proposed methodology is that if OLAP users wish to query only the most recent information, they only need to do so against the temporary replicated tables. For instance, if the temporary tables are meant to be filled with data for each business day before they are Figure 4. Partial contents of the temporary fact table SalesTmp with exemplification of record insertions.
  • 21. recreated, and we want to know the sales value of the current day, per store, the adequate response could be obtained from the following SQL instruction: SELECT STmp_StoreKey, Sum(STmp_Value) AS TodaysValue FROM SalesTmp WHERE STmp_Date=SystemDate() GROUP BY STmp_StoreKey This way, our method aids the processing of the data warehouse’s most recent data, for this kind of data is stored within the temporary replica tables, which are presumed to be small in size. This minimizes CPU, memory and I/O costs involved in most recent data query processing. Theoretically, this would make it possible to deliver the most recent decision making information while the business transaction itself occurs. 4.4 Packing and Reoptimizing the Data Warehouse Since the data is being integrated within tables that do not have access optimization of any kind that could speed up querying, such as indexes, for instance, it is obvious that its functionality is affected, implying a decrease of performance. Due to the size of physically occupied space, after a certain number of insertions the performance becomes too poor to consider as acceptable. To regain performance in the DW it is necessary to execute a pack
  • 22. routine which will update the original DW schema tables using the records in the temporary tables, and recreate these temporary tables empty of contents, along with rebuilding the original tables’ indexes and materialized views, so that maximum processing speed is obtained once more. For updating the original DW tables, the rows in the temporary tables should be aggregated according to the original tables’ primary keys, maintaining the rows with highest unique counter attribute value for possible duplicate values in non-additive attributes, for they represent the most recent records. The time needed for executing these update procedures represents the only period of time in which the DW in unavailable to OLAP tools and end users, for they need to be executed exclusively. The appropriate moment for doing this may be determined by the Database Administrator, or automatically, taking under consideration parameters such as a determined number of records in the temporary tables, the amount of physically occupied space, or yet a predefined period of time. The determination of this moment should consist on the best possible compromise related to its frequency of execution and the amount of time it takes away from all user availability, which depends on the physical, logical and practical characteristics inherent to each specific DW implementation itself and is not object of discussion in this paper. 4.5 Final Remarks on Our Methodology Notice that only record insertions are used for updating the DW for all related transactions in the OLTP source systems. Since
  • 23. this type of procedure does not require any record locking in the tables (except for the appended record itself) nor search operations for previously stored data before writing data (like in update or delete instructions), the time necessary to accomplish this is minimal. The issue of record locking is strongly enforced by the fact that the referred tables do not have any indexes or primary keys, implying absolutely no record locking, except for the appended record itself. Furthermore, since they do not have constraints of any sort, including referential integrity and primary keys, there is no need to execute time consuming tasks such as index updating or referential integrity cross checks. Kimball refers in [12] that many ETL tools use a UPDATE ELSE INSERT function in loading data, considering this as a performance killer. With our method, any appending, updating or eliminating data tasks on OLTP systems reflect themselves as only new record insertions in the data warehouse, which allows minimizing row, block and table locks and other concurrent data access problems. Physical database tablespace fragmentation is also avoided, once there is now deletion of data, only sequential increments. This allows us to state that the data update time window for our methods is minimal for the insertion of new data, maximizing the availability of that data, and consequently contributing to effectively increase the data warehouses’ global availability and minimize any negative impact in its performance. As mentioned earlier, the extracted data usually needs to be corrected and transformed before updating the data warehouse’s data area. Since we pretend to obtain this data near real-time,
  • 24. the time gap between recording OLTP transactions and their extraction by ETL processes is minimal, occurring nearly at the same time, which somewhat reduces error probability. We can also assume that the amount of intermediate “information buckets” which the data passes through in the ETL Area is also minimal, for temporary storage is almost not needed. Furthermore, instead of extracting a considerable amount of OLTP data, which is what happens in the “traditional” bulk loading data warehousing, the volume of information extracted and transformed in real-time is extremely reduced (representing commonly few dozen bytes), since it consists of only one transaction per execution cycle. All this allows assuming that the extraction and transformation phase will be cleaner and more time efficient. As a limitation to our methodology, data warehouse contexts in which additive attributes are difficult or even not possible to define for their fact tables may invalidate its practice. 5. EXPERIMENTAL EVALUATION Recurring to TPC-H decision support benchmark (TPC-H) we tested our methodology, creating 5GB, 10GB and 20GB sized DWs for continuous data integration at several time rates, in ORACLE 10g DBMS [16]. We used an Intel Celeron 2.8GHz with variations of 512MB, 1GB and 2GB of SDRAM and a 7200rpm 160GB hard disk. The modified TPC-H schema according to our methodology can be seen in Figure 5. Tables Region and Nation are not included as temporary tables because they are fixed-size tables in the TPC-H benchmark, and therefore do not receive new data. For the query workloads we selected TPC-H queries 1, 8, 12 and 20 (TPC-H), executed in random order for each simultaneous user in the tests, for up to 8 hours of testing for each scenario.
  • 25. The summarized results which are presented in this section correspond to a total amount of approximately 4000 hours of testing. The definition of the number of transactions, average time interval between transactions, number of records and data size to be inserted for each scenario number of records are illustrated in Table 1. Each new transaction represents the insertion of an average of four records in LineItemTmp and one row in each of the other temporary tables, continuously being integrated for a period of 8 hours. For each database size, we also tested three different transaction insertion time interval frequencies, referenced as 1D, 2D and 4D, each corresponding to a different amount of records to be inserted throughout the tests. The impact in OLAP processing while performing the continuous data integration is shown in Figures 6 and 7. Table 2 resumes the results for each scenario. All results show that the system is significantly dependable on the transaction rates, performing better if more RAM is used. They present a minimal cost of around 8% of query response time for practicing our methodology, independently from the amount of RAM, for a transactional frequency of inserting 1 transaction every 9,38 seconds. Apparently, this is the lowest price to pay for having real
  • 26. time data warehousing with our methodology, for the tested benchmark. At the highest transactional frequency, 1 every 0,57 seconds, the increase of OLAP response time climbs up to a maximum of 38,5% with 512MB of RAM memory. The results also show that the methodology is also relatively scalable in what concerns database size. Scenario Nr. Transactions to load in 8 Hours Average Time between Transactions N.º Records to integrate in 8 Hours Total Size of RTDW Transactions DW5GB-1D 3.072 9,38 seconds 54.040 4,59 MBytes DW5GB-2D 6.205 4,64 seconds 106.302 9,28 MBytes DW5GB-4D 12.453 2,31 seconds 204.904 18,62 MBytes DW10GB-1D 6.192 4,65 seconds 109.464 9,26 MBytes DW10GB-2D 12.592 2,29 seconds 213.912 18,83 MBytes DW10GB-4D 25.067 1,15 seconds 410.764 37,48 MBytes DW20GB-1D 12.416 2,32 seconds 218.634 18,57 MBytes DW20GB-2D 25.062 1,15 seconds 429.214 37,48 MBytes DW20GB-4D 50.237 0,57 seconds 825.242 75,12 MBytes Table 1. Characterization of the experimental evaluation scenarios. Figure 5. TPC-H data warehouse schema modified for supporting real-time data warehousing.
  • 27. Average Increase OLAP Response Time (Data Warehouse Dimension - RAM Memory) 8,8 8,3 8,5 9,7 9,1 23,0 20,4 17,4 31,9 25,3 20,1 38,5 30,4 25,9 10,1 17,7 11,213,4 17,2 20,2 27,8 15,418,0 22,6 13,314,315,7 0,0
  • 29. RTDW-1D 8,8 8,3 8,5 13,4 11,2 9,7 17,7 10,1 9,1 RTDW-2D 15,7 14,3 13,3 22,6 18,0 15,4 27,8 20,2 17,2 RTDW-4D 23,0 20,4 17,4 31,9 25,3 20,1 38,5 30,4 25,9 5GB-512MB 5GB-1GB 5GB-2GB 10GB-512MB 10GB-1GB 10GB-2GB 20GB-512MB 20GB-1GB 20GB-2GB Ave rage increase OLAP re spons e tim e percentage for query w ork load per transactional frequency 8,8 14,6 21,1 29,9 38,5 12,8 16,2 22,8 30,4 11,5 14,0 18,7
  • 31. ns e tim e in cr ea se ( % ) 512MB RAM 1GB RAM 2GB RAM 512MB RAM 8,8 14,6 21,1 29,9 38,5 1GB RAM 8,3 12,8 16,2 22,8 30,4 2GB RAM 8,5 11,5 14,0 18,7 25,9 9,38 sec 4,65 sec 2,32 sec 1,15 sec 0,57 sec
  • 32. Figure 6. Average increase of the OLAP response time (Data Warehouse dimension – RAM memory). Figure 7. Average increase of the OLAP response time per transactional frequency. Data Warehouse 5 GBytes Data Warehouse 10 GBytes Data Warehouse 20 GBytes 512 MBytes 1 GBytes 2 GBytes 512 MBytes 1 Gbyte 2 GBytes 512 MBytes 1 GByte 2 GBytes
  • 33. 9,38 seconds 8,8% 8,3% 8,5% nt nt nt nt nt nt 4,65 seconds 15,7% 14,3% 13,3% 13,4% 11,2% 9,7% nt nt nt 2,32 seconds 23,0% 20,4% 17,4% 22,6% 18,0% 15,4% 17,7% 10,1% 9,1% 1,15 seconds nt nt nt 31,9% 25,3% 20,1% 27,8% 20,2% 17,2% 0,57 seconds nt nt nt nt nt nt 38,5% 30,4% 25,9% nt – not tested 6. CONCLUSIONS AND FUTURE WORK This paper refers the necessary requirements for RTDW and presents a methodology for supporting the implementation of RTDW by enabling continuous data integration while minimizing impact in query execution on the user end of the DW. This is achieved by data structure replication and adapting query instructions in order to take advantage of the new real time data warehousing schemas. We have shown its functionality, recurring to a simulation using the TPC-H benchmark, performing continuous data integration at various time rates against the execution of various simultaneous query workloads, for data warehouses with different scale sizes. All scenarios show that it is possible to achieve real-time data warehousing performance in exchange for an average increase of query execution time. This should be considered the price to pay for real-time capability within the data warehouse.
  • 34. As future work we intend to develop an ETL tool which will integrate this methodology with extraction and transformation routines for the OLTP systems. There is also room for optimizing the query instructions used for our methods. 7. REFERENCES [1] D. J. Abadi, D. Carney, et al., 2003. “Aurora: A New Model and Architecture for Data Stream Management”, The VLDB Journal, 12(2), pp. 120-139. [2] S. Babu, and J. Widom, 2001. “Continuous Queries Over Data Streams”, SIGMOD Record 30(3), pp. 109-120. [3] T. Binder, 2003. Gong User Manual, Tecco Software Entwicklung AG. [4] M. Bouzeghoub, F. Fabret, and M. Matulovic, 1999. “Modeling Data Warehouse Refreshment Process as a Workflow Application”, Intern. Workshop on Design and Management of Data Warehouses (DMDW). [5] R. M. Bruckner, B. List, and J. Schiefer, 2002. “Striving Towards Near Real-Time Data Integration for Data Warehouses”, International Conference on Data Warehousing and Knowledge Discovery (DAWAK). [6] R. M. Bruckner, and A. M. Tjoa, 2002. “Capturing Delays and Valid Times in Data Warehouses – Towards Timely Consistent Analyses”. Journal of Intelligent Information Systems (JIIS), 19:2, pp. 169-190. [7] S. Chaudhuri, and U. Dayal, 1997. “An Overview of Data Warehousing and OLAP Technology”, SIGMOD Record,
  • 35. Volume 26, Number 1, pp. 65-74. [8] W. H. Inmon, R. H. Terdeman, J. Norris-Montanari, and D. Meers, 2001. Data Warehousing for E-Business, J. Wiley & Sons. [9] I. C. Italiano, and J. E. Ferreira, 2006. “Synchronization Options for Data Warehouse Designs”, IEEE Computer Magazine. [10] A. Karakasidis, P. Vassiliadis, and E. Pitoura, 2005. “ETL Queues for Active Data Warehousing”, IQIS’05. [11] R. Kimball, L. Reeves, M. Ross, and W. Thornthwaite, 1998. The Data Warehouse Lifecycle Toolkit – Expert Methods for Designing, Developing and Deploying Data Warehouses, Wiley Computer Publishing. [12] R. Kimball, and J. Caserta, 2004. The Data Warehouse ETL Toolkit, Wiley Computer Publishing. [13] E. Kuhn, 2003. “The Zero-Delay Data Warehouse: Mobilizing Heterogeneous Databases”, International Conference on Very Large Data Bases (VLDB). [14] W. Labio, J. Yang, Y. Cui, H. Garcia-Molina, and J. Widom, 2000. “Performance Issues in Incremental Warehouse Maintenance”, International Conference on Very Large Data Bases (VLDB). [15] D. Lomet, and J. Gehrke, 2003. Special Issue on Data Stream Processing, IEEE Data Eng. Bulletin, 26(1). [16] Oracle Corporation, 2005. www.oracle.com
  • 36. [17] T. B. Pedersen, 2004. “How is BI Used in Industry?”, Int. Conf. on Data Warehousing and Knowledge Discovery (DAWAK). [18] J. F. Roddick, and M. Schrefl, 2000. “Towards an Accommodation of Delay in Temporal Active Databases”, 11th ADC. [19] A. Simitsis, P. Vassiliadis and T. Sellis, 2005. “Optimizing ETL Processes in Data Warehouses”, Inte. Conference on Data Engineering (ICDE). Table 2. Characterization of the average increase of OLAP response times for each experimental scenario. [20] U. Srivastava, and J. Widom, 2004. “Flexible Time Management in Data Stream Systems”, PODS. [21] D. Theodoratus, and M. Bouzeghoub, 1999. “Data Currency Quality Factors in Data Warehouse Design”, International Workshop on the Design and Management of Data Warehouses (DMDW). [22] TPC-H decision support benchmark, Transaction Processing Council, www.tpc.com. [23] P. Vassiliadis, Z. Vagena, S. Skiadopoulos, N. Karayannidis, and T. Sellis, 2001. “ARKTOS: Towards the Modelling, Design, Control and Execution of ETL Processes”, Information Systems, Vol. 26(8).
  • 37. [24] C. White, 2002. “Intelligent Business Strategies: Real- Time Data Warehousing Heats Up”, DM Preview, www.dmreview.com/article_sub_cfm?articleId=5570. [25] J. Yang, 2001. “Temporal Data Warehousing”, Ph.D. Thesis, Dpt. Computer Science, Stanford University. [26] J. Yang, and J. Widom, 2001. “Incremental Computation and Maintenance of Temporal Aggregates”, 17th Intern. Conference on Data Engineering (ICDE). [27] J. Yang, and J. Widom, 2001. “Temporal View Self- Maintenance”, 7th Int. Conf. Extending Database Technology (EDBT). [28] T. Zurek, and K. Kreplin, 2001. “SAP Business Information Warehouse – From Data Warehousing to an E-Business Platform”, 17th International Conference on Data Engineering (ICDE). An Oracle White Paper August 2012 Best Practices for Real-time Data Warehousing
  • 38. Best Practices for Real-time Data Warehousing 1 Executive Overview Today’s integration project teams face the daunting challenge that, while data volumes are exponentially growing, the need for timely and accurate business intelligence is also constantly increasing. Batches for data warehouse loads used to be scheduled daily to weekly; today’s businesses demand information that is as fresh as possible. The value of this real- time business data decreases as it gets older, latency of data integration is essential for the business value of the data warehouse. At the same time the concept of “business hours” is vanishing for a global enterprise, as data warehouses are in use 24 hours a day, 365 days a
  • 39. year. This means that the traditional nightly batch windows are becoming harder to accommodate, and interrupting or slowing down sources is not acceptable at any time during the day. Finally, integration projects have to be completed in shorter release timeframes, while fully meeting functional, performance, and quality specifications on time and within budget. These processes must be maintainable over time, and the completed work should be reusable for further, more cohesive, integration initiatives. Conventional “Extract, Transform, Load” (ETL) tools closely intermix data transformation rules with integration process procedures, requiring the development of both data transformations and data flow. Oracle Data Integrator (ODI) takes a different approach to integration by clearly separating the declarative rules (the “what”) from the actual implementation (the “how”). With ODI, declarative rules describing mappings and transformations are defined graphically, through a drag-and- drop interface, and stored independently from the implementation. ODI automatically
  • 40. generates the data flow, which can be fine-tuned if required. This innovative approach for declarative design has also been applied to ODI's framework for Changed Data Capture (CDC). ODI’s CDC moves only changed data to the target systems and can be integrated with Oracle GoldenGate, thereby enabling the kind of real time integration that businesses require. This technical brief describes several techniques available in ODI to adjust data latency from scheduled batches to continuous real-time integration. Best Practices for Real-time Data Warehousing 2 Introduction The conventional approach to data integration involves extracting all data from the source system and then integrating the entire set—possibly using an incremental strategy—in the target system. This approach, which is suitable in most cases,
  • 41. can be inefficient when the integration process requires real-time data integration. In such situations, the amount of data involved makes data integration impossible in the given timeframes. Basic solutions, such as filtering records according to a timestamp column or “changed” flag, are possible, but they might require modifications in the applications. In addition, they usually do not sufficiently ensure that all changes are taken into account. ODI’s Changed Data Capture identifies and captures data as it is being inserted, updated, or deleted from datastores, and it makes the changed data available for integration processes. Real-Time Data Integration Use Cases Integration teams require real-time data integration with low or no data latency for a number of use cases. While this whitepaper focuses on data warehousing, it is useful to differentiate the following areas: - Real-time data warehousing Aggregation of analytical data in a data warehouse using
  • 42. continuous or near real- time loads. - Operational reporting and dashboards Selection of operational data into a reporting database for BI tools and dashboards. - Query Offloading Replication of high-cost or legacy OLTP servers to secondary systems to ease query load. - High Availability / Disaster Recovery Duplication of database systems in active-active or active- passive scenarios to improve availability during outages. - Zero Downtime Migrations Ability to synchronize data between old and new systems with potentially different technologies to allow for switch-over and switch-back without downtime. - Data Federation / Data Services Provide virtual, canonical views of data distributed over several systems through
  • 43. federated queries over heterogeneous sources. Best Practices for Real-time Data Warehousing 3 Oracle has various solutions for different real-time data integration use cases. Query offloading, high availability/disaster recovery, and zero- downtime migrations can be handled through the Oracle GoldenGate product that provides heterogeneous, non-intrusive and highly performant changed data capture, routing, and delivery. In order to provide no to low latency loads, ODI has various alternatives for real-time data warehousing through the use of CDC mechanism, including the integration with Oracle GoldenGate. This integration also provides seamless operational reporting. Data federation and data service use cases are covered by Oracle Data Service Integrator (ODSI). Architectures for Loading Data Warehouses Various architectures for collecting transactional data from
  • 44. operational sources have been used to populate data warehouses. These techniques vary mostly on the latency of data integration, from daily batches to continuous real-time integration. The capture of data from sources is either performed through incremental queries that filter based on a timestamp or flag, or through a CDC mechanism that detects any changes as it is happening. Architectures are further distinguished between pull and push operation, where a pull operation polls in fixed intervals for new data, while in a push operation data is loaded into the target once a change appears. A daily batch mechanism is most suitable if intra-day freshness is not required for the data, such as longer-term trends or data that is only calculated once daily, for example financial close information. Batch loads might be performed in a downtime window, if the business model doesn’t require 24 hour availability of the data warehouse. Different techniques such as real-time partitioning or trickle-and-flip1 exist to minimize the impact of a load to a live data warehouse without downtime. Batch Mini-Batch Micro-Batch Real-Time Description Data is loaded in full or incrementally using a off-peak window. Data is loaded incrementally
  • 45. using intra-day loads. Source changes are captured and accumulated to be loaded in intervals. Source changes are captured and immediately applied to the DW. Latency Daily or higher Hourly or higher 15min & higher sub- second Capture Filter Query Filter Query CDC CDC Intialization Pull Pull Push, then Pull Push Target Load High Impact Low Impact, load frequency is tuneable Source Load High Impact Queries at peak times necessary Some to none depending on CDC technique 1 See also: Real-Time Data Warehousing: Challenges and
  • 46. Solution s by Justin Langseth (http://dssresources.com/papers/features/langseth/langseth02082 004.html) Best Practices for Real-time Data Warehousing 4 IMPLEMENTING CDC WITH ODI Change Data Capture as a concept is natively embedded in ODI. It is controlled by the modular Knowledge Module concept and supports different methods of CDC. This chapter describes the details and benefits of the ODI CDC feature. Modular Framework for Different Load Mechanisms
  • 47. ODI supports each of the described data warehouse load architectures with its modular Knowledge Module architecture. Knowledge Modules enable integration designers to separate the declarative rules of data mapping from their technical implementation by selecting a best practice mechanism for data integration. Batch and Mini-Batch strategies can be defined by selecting Load Knowledge Modules (LKM) for the appropriate incremental load from the sources. Micro-Batch and Real-Time strategies use the Journalizing Knowledge Modules (JKM) to select a CDC mechanism to immediately access changes in the data sources. Mapping logic can be left unchanged for switching KM strategies, so that a
  • 48. change in loading patterns and latency does not require a rewrite of the integration logic. Methods for Tracking Changes using CDC ODI has abstracted the concept of CDC into a journalizing framework with a JKM and journalizing infrastructure at its core. By isolating the physical specifics of the capture process from the process of detected changes, it is possible to support a number of different techniques that are represented by individual JKMs: Non-invasive CDC through Oracle GoldenGate Oracle GoldenGate provides a CDC mechanism that can process source changes non- invasively by processing log files of completed transactions and storing these captured
  • 49. changes into external Trail Files independent of the database. Changes are then reliably transferred to a staging database. The JKM uses the metadata managed by ODI to generate Figure 1: GoldenGate-based CDC J$ T Source Log S ODI Load S GoldenGate Target Staging
  • 50. Real-Time Reporting Best Practices for Real-time Data Warehousing 5 all Oracle GoldenGate configuration files, and processes all GoldenGate-detected changes in the staging area. These changes will be loaded into the target data warehouse using ODI’s declarative transformation mappings. This architecture enables separate real-time reporting on the normalized staging area tables in addition to loading and transforming the data into the analytical data warehouse tables. Database Log Facilities
  • 51. Some databases provide APIs and utilities to process table changes programmatically. For example, Oracle provides the Streams interface to process log entries and store them in separate tables. Such log-based JKMs have better scalability than trigger-based mechanisms, but still require changes to the source database. ODI also supports log-based CDC on DB2 for iSeries using its journals. Database Triggers JKMs based on database triggers define procedures that are executed inside the source database when a table change occurs. Based on the wide availability of trigger mechanisms in databases, JKMs based on triggers are available for a wide range of sources such as Oracle,
  • 52. IBM DB2 for iSeries and UDB, Informix, Microsoft SQL Server, Sybase, and others. The disadvantage is the limited scalability and performance of trigger procedures, making them optimal for use cases with light to medium loads. Figure 3: Trigger-based CDC Trigger S J$ T Source Target ODI Load Figure 2: Streams-based CDC on Oracle S J$ T ODI Load
  • 53. Log Oracle Streams Target Source Best Practices for Real-time Data Warehousing 6 Oracle Changed Data Capture Adapters Oracle also provides separate CDC adapters that cover legacy platforms such as DB2 for z/OS, VSAM CICS, VSAM Batch, IMS/DB and Adabas. These adapters provide performance by capturing changes directly from the database logs. Source databases supported for ODI CDC Database Log-based CDC Trigger-
  • 54. based CDC JKM Oracle GoldenGate Database Log Facilities Oracle CDC Adapters Oracle � � � MS SQL Server � � � Sybase ASE � � DB2 UDB � � DB2 for iSeries �2 � �
  • 55. DB2 for z/OS �2 � Informix, Hypersonic DB � Teradata, Enscribe, MySQL, SQL/MP, SQL/MX �2 VSAM CICS, VSAM Batch, IMS/DB, Adabas �
  • 56. 2 Requires customization of Oracle GoldenGate configuration generated by JKM Best Practices for Real-time Data Warehousing 7 Publish-and-Subscribe Model The ODI journalizing framework uses a publish-and-subscribe model. This model works in three steps: 1. An identified subscriber, usually an integration process, subscribes to changes that might occur in a datastore. Multiple subscribers can subscribe to these changes.
  • 57. 2. The Changed Data Capture framework captures changes in the datastore and then publishes them for the subscriber. 3. The subscriber—an integration process—can process the tracked changes at any time and consume these events. Once consumed, events are no longer available for this subscriber. ODI processes datastore changes in two ways: - Regularly in batches (pull mode)—for example, processes new orders from the Web site every five minutes and loads them into the operational datastore (ODS) - In real time (push mode) as the changes occur—for example, when a product is changed in the enterprise resource planning (ERP) system,
  • 58. immediately updates the on-line catalog Processing the Changes ODI employs a powerful declarative design approach, Extract- Load, Transform (E-LT), which separates the rules from the implementation details. Its out-of-the-box integration interfaces use and process the tracked changes. Developers define the declarative rules for the captured changes within the integration processes in the Designer Navigator of the ODI Studio graphical user interface—without having to code. In ODI Studio, customers declaratively specify set-based maps between sources and targets, and then the system automatically generates the data flow from the
  • 59. set-based maps. Figure 4: The ODI Journalizing Framework uses a publish-and- subscribe architecture Order #5A32 Orders Order #5A32 CDC Subscribe Consume Capture/Publish Consume Subscribe Integration Process 2
  • 60. Target 1 Target 2 Integration Process 1 Best Practices for Real-time Data Warehousing 8 The technical processes required for processing the changes captured are implemented in ODI’s Knowledge Modules. Knowledge Modules are scripted modules that contain database and application-specific patterns. The runtime then interprets these modules and optimizes the instructions for targets.
  • 61. Ensuring Data Consistency Changes frequently involve several datastores at one time. For example, when an order is created, updated, or deleted, it involves both the orders table and the order lines table. When processing a new order line, the new order to which this line is related must be taken into account. ODI provides a mode of tracking changes, called Consistent Set Changed Data Capture, for this purpose. This mode allows you to process sets of changes that guarantee data consistency. Best Practices using ODI for Real-Time Data Warehousing As with other approaches there is no one-size-fits-all approach when it comes to Real-Time
  • 62. Data Warehousing. Much depends on the latency requirements, overall data volume as well as the daily change volume, load patterns on sources and targets, as well as structure and query requirements of the data warehouse. As covered in this paper, ODI supports all approaches of loading a data warehouse. In practice there is one approach that satisfies the majority of real-time data warehousing use cases: The micro-batch approach using GoldenGate-based CDC with ODI. In this approach, one or more tables from operational databases are used as sources and are replicated by GoldenGate into a staging area database. This staging area provides a real-time copy of the transactional data for real-time reporting using BI
  • 63. tools and dashboards. The operational sources are not additionally stressed as GoldenGate capture is non-invasive and performant, and the separate staging area handles operational BI queries without adding load to the transactional system. ODI performs a load of the changed records to the real-time data warehouse in frequent periods of 15 minutes or more. This pattern has demonstrated the best combination of providing fresh, actionable data to the data warehouse without introducing inconsistencies in aggregates calculated in the data warehouse. Best Practices for Real-time Data Warehousing 9
  • 64. Customer Use Case: Overstock.com Overstock.com is using both GoldenGate and ODI to load customer transaction data into a real-time data warehouse. GoldenGate is used to capture changes from its retail site, while ODI is used for complex transformations in its data warehouse. Overstock.com has seen the benefits of leveraging customer data in real time. When the company sends out an e-mail campaign, rather than waiting one, two, or three days, it immediately sees whether consumers are clicking in the right place, if the e-mail is driving consumers to the site, and if those customers are making purchases. By accessing the data in real time using Oracle GoldenGate and transforming it using
  • 65. Oracle Data Integrator, Overstock.com can immediately track customer behavior, profitability of campaigns, and ROI. The retailer can now analyze customer behavior and purchase history to target marketing campaigns and service in real time and provide better service to its customers. Conclusion Integrating data and applications throughout the enterprise, and presenting a consolidated view of them, is a complex proposition. Not only are there broad disparities in data structures and application functionality, but there are also fundamental differences in integration architectures. Some integration needs are data oriented, especially those involving
  • 66. large data volumes. Other integration projects lend themselves to an event-oriented architecture for asynchronous or synchronous integration. Changes tracked by Changed Data Capture constitute data events. The ability to track these events and process them regularly in batches or in real time is key to the success of an event- driven integration architecture. Oracle Data Integrator provides rapid implementation and maintenance for all types of integration projects. Figure 5: Micro-Batch Architecture using ODI and GoldenGate Operational Source(s) Log Periodical ODI Load
  • 67. GoldenGate Real-Time Data Warehouse Staging Area for Operational BI Operational BI / Real-Time Reporting Best Practices for Real-time Data Warehousing August 2012
  • 68. Author: Alex Kotopoulis Oracle Corporation World Headquarters 500 Oracle Parkway Redwood Shores, CA 94065 U.S.A. Worldwide Inquiries: Phone: +1.650.506.7000 Fax: +1.650.506.7200 oracle.com
  • 69. Copyright © 2010, Oracle and/or its affiliates. All rights reserved. This document is provided for information purposes only and the contents hereof are subject to change without notice. This document is not warranted to be error-free, nor subject to any other warranties or conditions, whether expressed orally or implied in law, including implied warranties and conditions of merchantability or fitness for a particular purpose. We specifically disclaim any liability with respect to this document and no contractual obligations are formed either directly or indirectly by this document. This document may not be reproduced or transmitted in any form or by any means, electronic or mechanical, for any purpose, without our prior written permission. Oracle is a registered trademark of Oracle Corporation and/or its affiliates. Other names may be trademarks of their respective