Etl design document

ARTICLE IN PRESS

Information Systems 30 (2005) 492–525
www.elsevier.com/locate/infosys

A generic and customizable framework for the design
of ETL scenarios
Panos Vassiliadisa, Alkis Simitsisb, Panos Georgantasb, Manolis Terrovitisb,
Spiros Skiadopoulosb
a
Department of Computer Science, University of Ioannina, Ioannina, Greece
b
Department of Electrical and Computer Engineering, National Technical University of Athens, Athens, Greece

Abstract

Extraction–transformation–loading (ETL) tools are pieces of software responsible for the extraction of data from
several sources, their cleansing, customization and insertion into a data warehouse. In this paper, we delve into the
logical design of ETL scenarios and provide a generic and customizable framework in order to support the DW
designer in his task. First, we present a metamodel particularly customized for the definition of ETL activities. We
follow a workflow-like approach, where the output of a certain activity can either be stored persistently or passed to a
subsequent activity. Also, we employ a declarative database programming language, LDL, to define the semantics of
each activity. The metamodel is generic enough to capture any possible ETL activity. Nevertheless, in the pursuit of
higher reusability and flexibility, we specialize the set of our generic metamodel constructs with a palette of frequently
used ETL activities, which we call templates. Moreover, in order to achieve a uniform extensibility mechanism for this
library of built-ins, we have to deal with specific language issues. Therefore, we also discuss the mechanics of template
instantiation to concrete activities. The design concepts that we introduce have been implemented in a tool, ARKTOS II,
which is also presented.
r 2004 Elsevier B.V. All rights reserved.

Keywords: Data warehousing; ETL

1. Introduction data extraction, transformation, integration,
cleaning and transport. To deal with this work-
Data warehouse operational processes normally flow, specialized tools are already available in the
compose a labor-intensive workflow, involving market [1–4], under the general title Extraction—
Transformation– Loading (ETL) tools. To give a
general idea of the functionality of these tools we
E-mail addresses: pvassil@cs.uoi.gr (P. Vassiliadis), asimi@
dbnet.ece.ntua.gr (A. Simitsis), pgeor@dbnet.ece.ntua.gr
mention their most prominent tasks, which include
(P. Georgantas), mter@dbnet.ece.ntua.gr (M. Terrovitis), (a) the identification of relevant information at the
spiros@dbnet.ece.ntua.gr (S. Skiadopoulos). source side, (b) the extraction of this information,

0306-4379/$ - see front matter r 2004 Elsevier B.V. All rights reserved.
doi:10.1016/j.is.2004.11.002

ARTICLE IN PRESS

P. Vassiliadis et al. / Information Systems 30 (2005) 492–525 493

Logical Perspective Physical Perspective

Execution Plan
Resource Layer

Execution Sequence
Execution Schedule
Recovery Plan

Relationship
with data Operational Layer

Primary Data Flow
Data Flowfor Logical Exceptions

Administration Plan

Monitoring & Logging
Security & Access Rights Management

Fig. 1. Different perspectives for an ETL workflow.

(c) the customization and integration of the defining an execution plan for the scenario. The
information coming from multiple sources into a definition of an execution plan can be seen from
common format, (d) the cleaning of the resulting various perspectives. The execution sequence in-
data set on the basis of database and business volves the specification of which activity runs first,
rules, and (e) the propagation of the data to the second, and so on, which activities run in parallel,
data warehouse and/or data marts. or when a semaphore is defined so that several
If we treat an ETL scenario as a composite activities are synchronized at a rendezvous point.
workflow, in a traditional way, its designer is ETL activities normally run in batch, so the
obliged to define several of its parameters (Fig. 1). designer needs to specify an execution schedule,
Here, we follow a multi-perspective approach that i.e., the time points or events that trigger the
enables to separate these parameters and study execution of the scenario as a whole. Finally, due
them in a principled approach. We are mainly to system crashes, it is imperative that there exists
interested in the design and administration parts of a recovery plan, specifying the sequence of steps to
the lifecycle of the overall ETL process, and we be taken in the case of failure for a certain activity
depict them at the upper and lower part of Fig. 1, (e.g., retry to execute the activity, or undo any
respectively. At the top of Fig. 1, we are mainly intermediate results produced so far). On the right-
concerned with the static design artifacts for a hand side of Fig. 1, we can also see the physical
workflow environment. We will follow a tradi- perspective, involving the registration of the actual
tional approach and group the design artifacts into entities that exist in the real world. We will reuse
logical and physical, with each category compris- the terminology of [5] for the physical perspective.
ing its own perspective. We depict the logical The resource layer comprises the definition of roles
perspective on the left-hand side of Fig. 1, and the (human or software) that are responsible for
physical perspective on the right-hand side. At the executing the activities of the workflow. The
logical perspective, we classify the design artifacts operational layer, at the same time, comprises the
that give an abstract description of the workflow software modules that implement the design
environment. First, the designer is responsible for entities of the logical perspective in the real world.

ARTICLE IN PRESS

494 P. Vassiliadis et al. / Information Systems 30 (2005) 492–525

In other words, the activities defined at the logical activities, which we call templates. Moreover, in
layer (in an abstract way) are materialized and order to achieve a uniform extensibility mechan-
executed through the specific software modules of ism for this library of built-ins, we have to deal
the physical perspective. with specific language issues: thus, we also discuss
At the lower part of Fig. 1, we are dealing with the mechanics of template instantiation to concrete
the tasks that concern the administration of the activities. The design concepts that we introduce
workflow environment and their dynamic beha- have been implemented in a tool, ARKTOS II, which
vior at runtime. First, an administration plan is also presented.
should be specified, involving the notification of Our contributions can be listed as follows:
the administrator either on-line (monitoring) or
off-line (logging) for the status of an executed First, we define a formal metamodel as an
activity, as well as the security and authentication abstraction of ETL processes at the logical level.
management for the ETL environment. The data stores, activities and their constituent
We find that research has not dealt with the parts are formally defined. An activity is defined
definition of data-centric workflows to the entirety as an entity with possibly more than one input
of its extent. In the ETL case, for example, due to schemata, an output schema and a parameter
the data centric nature of the process, the designer schema, so that the activity is populated each
must deal with the relationship of the involved time with its proper parameter values. The flow
activities with the underlying data. This involves the of data from producers towards their consumers
definition of a primary data flow that describes the is achieved through the usage of provider
route of data from the sources towards their final relationships that map the attributes of the
destination in the data warehouse, as they pass former to the respective attributes of the latter.
through the activities of the scenario. Also, due to A serializable combination of ETL activities,
possible quality problems of the processed data, provider relationships and data stores constitu-
the designer is obliged to define a data flow for tes an ETL scenario.
logical exceptions, i.e., a flow for the problematic Second, we provide a reusability framework that
data, i.e., the rows that violate integrity or business complements the genericity of the metamodel.
rules. It is the combination of the execution Practically, this is achieved from a set of ‘‘built-
sequence and the data flow that generates the in’’ specializations of the entities of the meta-
semantics of the ETL workflow: the data flow model layer, specifically tailored for the most
defines what each activity does and the execution frequent elements of ETL scenarios. This palette
plan defines in which order and combination. of template activities will be referred to as
In this paper, we work in the internals of the template layer and it is characterized by its
data flow of ETL scenarios. First, we present a extensibility; in fact, due to language considera-
metamodel particularly customized for the definitions, we provide the details of the mechanism
tion of ETL activities. We follow a workflow-like that instantiates templates to specific activities.
approach, where the output of a certain activity Finally, we discuss implementation issues and we
can either be stored persistently or passed to a present a graphical tool, ARKTOS II that facil-
subsequent activity. Moreover, we employ a itates the design of ETL scenarios, based on our
declarative database programming language, model.
LDL, to define the semantics of each activity.
The metamodel is generic enough to capture any This paper is organized as follows. In Section 2,
possible ETL activity; nevertheless, reusability and we present a generic model of ETL activities.
ease-of-use dictate that we can do better in aiding Section 3 describes the mechanism for specifying
the data warehouse designer in his task. In this and materializing template definitions of fre-
pursuit of higher reusability and flexibility, we quently used ETL activities. Section 4 presents
specialize the set of our generic metamodel ARKTOS II, a prototype graphical tool. In Section 5,
constructs with a palette of frequently used ETL we survey related work. In Section 6, we make a

ARTICLE IN PRESS


general discussion on the completeness and general staging area (DSA)1 DS. The scenario involves the
applicability of our approach. Section 7 offers propagation of data from the table PARTSUPP of
conclusions and presents topics for future resource S1 to the data warehouse DW. Table
search. Short versions of parts of this paper have DW.PARTSUPP (PKEY, SOURCE, DATE, QTY,
been presented in [6,7]. COST) stores information for the available quan-
tity (QTY) and cost (COST) of parts (PKEY)
per source (SOURCE). The data source S1.
2. Generic model of ETL activities PARTSUPP (PKEY, DATE, QTY, COST) records
the supplies from a specific geographical region,
The purpose of this section is to present a formal e.g., Europe. All the attributes, except for the dates
logical model for the activities of an ETL are instances of the Integer type. The scenario is
environment. This model abstracts from the graphically depicted in Fig. 3 and involves the
technicalities of monitoring, scheduling and log- following transformations.
ging while it concentrates on the flow of data from
the sources towards the data warehouse through 1. First, we transfer via FTP_PS1 the snapshot
the composition of activities and data stores. The from the source S1.PARTSUPP to the file
full layout of an ETL scenario, involving activities, DS.PS1_NEW of the DSA.2
recordsets and functions can be modeled by a 2. In the DSA, we maintain locally a copy of the
graph, which we call the architecture graph. We snapshot of the source as it was at the previous
employ a uniform, graph-modeling framework for loading (we assume here the case of the
both the modeling of the internal structure of incremental maintenance of the DW, instead of
activities and the ETL scenario at large, which the case of the initial loading of the DW). The
enables the treatment of the ETL environment recordset DS.PS1_NEW (PKEY, DATE, QTY,
from different viewpoints. First, the architecture COST) stands for the last transferred snapshot
graph comprises all the activities and data stores of of S1.PARTSUPP. By detecting the difference
a scenario, along with their components. Second, of this snapshot with the respective version of
the architecture graph captures the data flow the previous loading, DS.PS1_OLD (PKEY,
within the ETL environment. Finally, the informa- DATE, QTY, COST), we can derive the newly
tion on the typing of the involved entities and the inserted rows in S1.PARTSUPP. Note that the
regulation of the execution of a scenario, through difference activity that we employ, namely
specific parameters are also covered. Diff_PS1, checks for differences only on the
primary key of the recordsets; thus, we ignore
2.1. Graphical notation and motivating example here any possible deletions or updates for the
attributes COST, QTY of existing rows. Any not
Being a graph, the architecture graph of an ETL newly inserted row is rejected and so, it is
scenario comprises nodes and edges. The involved propagated to Diff_PS1_REJ that stores all
data types, function types, constants, attributes, the rejected rows. The schema of Diff_PS1_
activities, recordsets, parameters and functions REJ is identical to the input schema of the
constitute the nodes of the graph. The different activity Diff_PS1.
kinds of relationships among these entities are
modeled as the edges of the graph. In Fig. 2, we 1
In data warehousing terminology a DSA is an intermediate
give the graphical notation for all the modeling area of the data warehouse, specifically destined to enable the
constructs that will be presented in the sequel. transformation, cleaning and integration of source data, before
Motivating example: To motivate our discus- being loaded to the warehouse.
2
sion, we will present an example involving the The technical points, like FTP, are mostly employed to show
what kind of problems someone has to deal with in a practical
propagation of data from a certain source S1, situation, rather than to relate this kind of physical operations
towards a data warehouse DW through intermedi- to a logical model. In terms of logical modelling this is a simple
ate recordsets. These recordsets belong to a data passing of data from one site to another.

ARTICLE IN PRESS


Data Types Black ellipsoid RecordSets Cylinders

Function
Black rectangles Functions Gray rectangles
Types

Constants Black circles 1 Parameters White rectangles

Attributes Unshaded ellipsoid Activities Triangles

Provider Bold solid arrows
Part-Of Simple lines with
Relationships (from provider to
Relationships diamond edges*
consumer)
Bold dotted
Dotted arrows Derived
Instance-Of arrows (from
(from instance Provider
Relationships provider to
towards the type) Relationships
consumer)
* We annotate the part-of relationship among a
Regulator
Dotted lines function and its return type with a directed edge, to
Relationships distinguish it from therest of the parameters.

Fig. 2. Graphical notation for the architecture graph.

DS.PS1.PKEY
DS.PS1_NEW.PKEY
LOOKUP.PKEY
= COST SOURCE = 1
LOOKUP.SOURCE
DS.PS1_OLD.PKEY
LOOKUP.SKEY
S1.PARTSUPP FTP_PS1 DS.PS1_NEW

Diff_PS1 NotNu111 Add_Attr1 DS.PS1 SK1 DW.PARTSUPP
Source

DS.PS1_OLD Data
Warehouse
Diff_PS1 Not Nul 111
LOOKUP
_REJ _ REJ

DSA

Fig. 3. Bird’s-eye view of the motivating example.

3. The rows that pass the activity Diff_PS1 are Diff_PS1_REJ recordset for further examina-
checked for null values of the attribute COST tion by the data warehouse administrator.
through the activity NotNull1. Rows having a 4. Although we consider the data ﬂow for only
NULL value for their COST are kept in the one source, namely S1, the data warehouse can

ARTICLE IN PRESS


clearly have more sources for part supplies. In Finally, before proceeding, we would like to
order to keep track of the source of each row stress that we do not anticipate a manual
entering into the DW, we need to add a ‘flag’ construction of the graph by the designer; rather,
attribute, namely SOURCE, indicating the re- we employ this section to clarify how the graph
spective source. This is achieved through the will look once constructed. To assist a more
activity Add_Attr1. We store the rows that automatic construction of ETL scenarios, we have
stem from this process in the recordset DS.PS1 implemented the ARKTOS II tool that supports the
(PKEY, SOURCE, DATE, QTY, COST). designing process through a friendly GUI. We
5. Next, we assign a surrogate key on PKEY. In the present ARKTOS II in Section 4.
data warehouse context, it is common tactics to
replace the keys of the production systems with 2.2. Preliminaries
a uniform key, which we call a surrogate key [8].
The basic reasons for this replacement are In this subsection, we will introduce the formal
performance and semantic homogeneity. Tex- modeling of data types, data stores and functions,
tual attributes are not the best candidates for before proceeding to the modeling of ETL
indexed keys and thus, they need to be replaced activities.
by integer keys. At the same time, different
production systems might use different keys for Elementary entities: We assume the existence of
the same object, or the same key for different a countable set of data types. Each data type T is
objects, resulting in the need for a global characterized by a name and a domain, i.e., a
replacement of these values in the data ware- countable set of values, called dom (T). The
house. This replacement is performed through a values of the domains are also referred to as
lookup table of the form L (PRODKEY, constants.
SOURCE, SKEY). The SOURCE column is due We also assume the existence of a countable set
to the fact that there can be synonyms in the of attributes, which constitute the most elementary
different sources, which are mapped to different granules of the infrastructure of the information
objects in the data warehouse. In our case, the system. Attributes are characterized by their name
activity that performs the surrogate key assign- and data type. The domain of an attribute is a
ment for the attribute PKEY is SK1. It uses the subset of the domain of its data type. Attributes
lookup table LOOKUP (PKEY, SOURCE, and constants are uniformly referred to as terms.
SKEY). Finally, we populate the data ware-
house with the output of the previous activity. A schema is a finite list of attributes. Each entity
that is characterized by one or more schemata will
The role of rejected rows depends on the be called structured entity. Moreover, we assume
peculiarities of each ETL scenario. If the designer the existence of a special family of schemata, all
needs to administrate these rows further, then he/ under the general name of NULL schema,
she should use intermediate storage recordsets determined to act as placeholders for data which
with the burden of an extra I/O cost. If the rejected are not to be stored permanently in some data
rows should not have a special treatment, then the store. We refer to a family instead of a single
best solution is to be ignored; thus, in this case we NULL schema, due to a subtle technicality
avoid overloading the scenario with any extra involving the number of attributes of such a
storage recordset. In our case, we annotate only schema (this will become clear in the sequel).
two of the presented activities with a destina-
tion for rejected rows. Out of these, while Recordsets: We define a record as the instantia-
NotNull1_REJ absolutely makes sense as a tion of a schema to a list of values belonging to
placeholder for problematic rows having non- the domains of the respective schema attributes.
acceptable NULL values, Diff_PS1_REJ is pre- We can treat any data structure as a re-
sented for demonstration reasons only. cordset provided that there are ways to logically

ARTICLE IN PRESS


restructure it into a flat, typed record schema. LDL, we avoid dealing with the peculiarities of a
Formally, a recordset is characterized by its name, particular programming language. Once again, we
its (logical) schema and its (physical) extension want to stress that the presented LDL description
(i.e., a finite set of records under the recordset is intended to capture the semantics of each
schema). If we consider a schema S ¼ activity, instead of the way these activities are
[A1, y, Ak], for a certain recordset, its extension actually implemented.
is a mapping S ¼ [A1, y, Ak]-dom(A1) Â y An elementary activity is formally described by
Â dom(Ak). Thus, the extension of the recordset the following elements:
is a finite subset of dom(A1) Â y Â dom(Ak) and
a record is the instance of a mapping dom(A1)
Name: A unique identifier for the activity.
Â y Â dom(Ak)-[x1,y,xk], xiAdom(Ai).
Input schemata: A finite set of one or more input
schemata that receives data from the data
In the rest of this paper we will mainly deal with
providers of the activity.
the two most popular types of recordsets, namely
relational tables and record files. A database is a
Output schema: A schema that describes the
placeholder for the rows that pass the check
finite set of relational tables.
performed by the elementary activity.
Functions. We assume the existence of a
Rejections schema: A schema that describes the
placeholder for the rows that do not pass the
countable set of built-in system function types. A
check performed by the activity, or their values
function type comprises a name, a finite list of
are not appropriate for the performed transfor-
parameter data types, and a single return data type.
mation.
A function is an instance of a function type.
Consequently, it is characterized by a name, a list
Parameter list: A set of pairs which act as
regulators for the functionality of the activity
of input parameters and a parameter for its return
(the target attribute of a foreign key check, for
value. The data types of the parameters of the
example). The first component of the pair is a
generating function type also define (a) the data
name and the second is a schema, an attribute, a
types of the parameters of the function and (b) the
function or a constant.
legal candidates for the function parameters (i.e.,
attributes or constants of a suitable data type).
Output operational semantics: An LDL state-
ment describing the content passed to the
output of the operation, with respect to its
2.3. Activities
input. This LDL statement defines (a) the
operation performed on the rows that pass
Activities are the backbone of the structure of
through the activity and (b) an implicit mapping
any information system. We adopt the WfMC
between the attributes of the input schema(ta)
terminology [9] for processes/programs and we will
and the respective attributes of the output
call them activities in the sequel. An activity is an
schema.
amount of ‘‘work which is processed by a
combination of resource and computer applica-
Rejection operational semantics: An LDL state-
ment describing the rejected records, in a sense
tions’’ [9]. In our framework, activities are logical
similar to the output operational semantics.
abstractions representing parts or full modules of
This statement is by default considered to be the
code.
complement of the output operational seman-
The execution of an activity is performed from a
tics, except if explicitly defined differently.
particular program. Normally, ETL activities will
be either performed in a black-box manner by a
There are two issues that we would like to
dedicated tool, or they will be expressed in some
elaborate on, here:
language (e.g., PL/SQL, Perl, C). Still, we want to
deal with the general case of ETL activities. We NULL schemata: Whenever we do not specify
employ an abstraction of the source code of an a data consumer for the output or rejec-
activity, in the form of an LDL statement. Using tion schemata, the respective NULL schema

ARTICLE IN PRESS


(involving the correct number of attributes) is notation of entities (nodes) and relationships
implied. This practically means that the data (edges) is presented in Fig. 2.
targeted for this schema will neither be stored to
some persistent data store, nor will they be Part-of relationships. These relationships in-
propagated to another activity, but they will volve attributes and parameters and relate them
simply be ignored. to the respective activity, recordset or function
Language issues: Initially, we used to specify the to which they belong.
semantics of activities with SQL statements. Instance-of relationships. These relationships are
Still, although clear and easy to write and defined among a data/function type and its
understand, SQL is rather hard to use if one is instances.
to perform rewriting and composition of state- Provider relationships. These are relationships
ments. Thus, we have supplemented SQL with that involve attributes with a provider–consu-
LDL [10], a logic programming, declarative mer relationship.
language as the basis of our scenario definition. Regulator relationships. These relationships are
LDL is a Datalog variant based on a Horn- defined among the parameters of activities and
clause logic that supports recursion, complex the terms that populate these activities.
objects and negation. In the context of its Derived provider relationships. A special case of
implementation in an actual deductive database provider relationships that occurs whenever
management system, LDL++ [11], the lan- output attributes are computed through the
guage has been extended to support external composition of input attributes and parameters.
functions, choice, aggregation (and even, user- Derived provider relationships can be deduced
defined aggregation), updates and several other from a simple rule and do not originally
features. constitute a part of the graph.

2.4. Relationships in the architecture graph In the rest of this subsection, we will detail the
notions pertaining to the relationships of the
In this subsection, we will elaborate on the Architecture Graph; the knowledgeable reader is
different kinds of relationships that the entities of referred to Section 2.5 where we discuss the issue
an ETL scenario have. Whereas these entities are of scenarios. We will base our discussions on a
modeled as the nodes of the architecture graph, part of the scenario of the motivating example
relationships are modeled as its edges. Due to their (presented in Section 2.1), including activity SK1.
diversity, before proceeding, we list these types of
relationships along with the related terminology Data types and instance-of relationships: To
that we will use in this paper. The graphical capture typing information on attributes and

OUT IN OUT IN DW.PARTS
DS.PS1 SK1
UPP

PKEY PKEY PKEY PKEY Integer

QTY QTY QTY QTY

COST COST COST COST

Date DATE DATE DATE DATE

SOURCE SOURCE SOURCE SOURCE

SKEY

Fig. 4. Instance-of relationships of the architecture graph.

ARTICLE IN PRESS


functions, the architecture graph comprises data tion schema of the activity, respectively). Natu-
and function types. Instantiation relationships are rally, if the activity involves more than one input
depicted as dotted arrows that stem from the schemata, the relationship is tagged with an INi
instances and head toward the data/function types. tag for the ith input schema. We also incorporate
In Fig. 4, we observe the attributes of the two the functions along with their respective para-
activities of our example and their correspondence meters and the part-of relationships among the
to two data types, namely integer and date. former and the latter. We annotate the part-of
For reasons of presentation, we merge several relationship with the return type with a directed
instantiation edges so that the figure does not edge, to distinguish it from the rest of the
become too crowded. parameters.
Fig. 5 depicts a part of the motivating example.
Attributes and part-of relationships: The first In terms of part-of relationships, we present the
thing to incorporate in the architecture graph is decomposition of (a) the recordsets DS.PS1,
the structured entities (activities and recordsets) LOOKUP, DW.PARTSUPP and (b) the activity SK1
along with all the attributes of their schemata. We and the attributes of its input and output
choose to avoid overloading the notation by schemata. Note the tagging of the schemata of
incorporating the schemata per se; instead we the involved activity. We do not consider the
apply a direct part-of relationship between an rejection schemata in order to avoid crowding the
activity node and the respective attributes. We picture. Also note, how the parameters of the
annotate each such relationship with the name of activity are also incorporated in the architecture
the schema (by default, we assume a IN, OUT, graph. Activity SK1 has five parameters: (a) PKEY,
PAR, REJ tag to denote whether the attribute which stands for the production key to be
belongs to the input, output, parameter or rejec- replaced, (b) SOURCE, which stands for an integer

OUT IN OUT IN DW.PARTS
DS.PS1 SK1
UPP
PAR

PKEY PKEY PKEY PKEY

QTY QTY QTY QTY

COST COST COST COST

DATE DATE DATE DATE

SOURCE SOURCE SOURCE SOURCE

SKEY

PKEY

SOURCE

PKEY LPKEY

OUT
LOOKUP SOURCE LSOURCE

SKEY LSKEY

Fig. 5. Part-of regulator and provider relationships of the architecture graph.

ARTICLE IN PRESS


value that characterizes which source’s data are relationship among the attributes of the involved
processed, (c) LPKEY, which stands for the schemata.
attribute of the lookup table which contains the Formally, a provider relationship is defined by
production keys, (d) LSOURCE, which stands for the following elements:
the attribute of the lookup table which contains
the source value (corresponding to the aforemen- Name: A unique identifier for the provider
tioned SOURCE parameter), (e) LSKEY, which relationship.
stands for the attribute of the lookup table which Mapping: An ordered pair. The first part of the
contains the surrogate keys. pair is a term (i.e., an attribute or constant)
acting as a provider and the second part is an
Parameters and regulator relationships: Once the attribute acting as the consumer.
part-of and instantiation relationships have been
established, it is time to establish the regulator The mapping need not necessarily be 1:1 from
relationships of the scenario. In this case, we link provider to consumer attributes, since an input
the parameters of the activities to the terms attribute can be mapped to more than one
(attributes or constants) that populate them. We consumer attributes. Still, the opposite does not
depict regulator relationships with simple dotted hold. Note that a consumer attribute can also be
edges. populated by a constant, in certain cases.
In the example of Fig. 5 we can also observe In order to achieve the flow of data from the
how the parameters of activity SK1 are populated providers of an activity towards its consumers, we
through regulator relationships. The parameters need the following three groups of provider
in and out are mapped to the respective terms relationships:
through regulator relationships. All the para-
meters of SK1, namely PKEY, SOURCE, LPKEY, 1. A mapping between the input schemata of the
LSOURCE and LSKEY, are mapped to the respec- activity and the output schema of their data
tive attributes of either the activity’s input schema providers. In other words, for each attribute of
or the employed lookup table LOOKUP. The an input schema of an activity, there must exist
parameter LSKEY deserves particular attention. an attribute of the data provider, or a constant,
This parameter is (a) populated from the attribute which is mapped to the former attribute.
SKEY of the lookup table and (b) used to populate 2. A mapping between the attributes of the activity
the attribute SKEY of the output schema of the input schemata and the activity output (or
activity. Thus, two regulator relationships are rejection, respectively) schema.
related with parameter LSKEY, one for each of 3. A mapping between the output or rejection
the aforementioned attributes. The existence of a schema of the activity and the (input) schema of
regulator relationship among a parameter and an its data consumer.
output attribute of an activity normally denotes
that some external data provider is employed in The mappings of the second type are internal to
order to derive a new attribute, through the the activity. Basically, they can be derived from the
respective parameter. LDL statement for each of the output/rejection
schemata. As far as the first and the third types of
Provider relationships: The flow of data from the provider relationships are concerned, the map-
data sources towards the data warehouse is pings must be provided during the construction of
performed through the composition of activities the ETL scenario. This means that they are either
in a larger scenario. In this context, the input for (a) by default assumed by the order of the
an activity can be either a persistent data store, or attributes of the involved schemata or (b) hard-
another activity. Usually, this applies for the coded by the user. Provider relationships are
output of an activity, too. We capture the passing depicted with bold solid arrows that stem from
of data from providers to consumers by a provider the provider and end in the consumer attribute.

ARTICLE IN PRESS


Observe Fig. 5. The flow starts from table relationships, previously discussed: the first rule
DS.PS1 of the data staging area. Each of the explains how the data from the DS.PS1 recordset
attributes of this table is mapped to an attribute of are fed into the input schema of the activity, the
the input schema of activity SK1. The attributes of second rule explains the semantics of activity (i.e.,
the input schema of the latter are subsequently how the surrogate key is generated) and, finally,
mapped to the attributes of the output schema of the third rule shows how the DW.PARTSUPP
the activity. The flow continues to DW.PARTSUPP. recordset is populated from the output schema of
Another interesting thing is that during the data the activity SK1.
flow, new attributes are generated, resulting on new
streams of data, whereas the flow seems to stop for Derived provider relationships: As we have
other attributes. Observe the rightmost part of already mentioned, there are certain output
Fig. 5 where the values of attribute PKEY are not attributes that are computed through the composi-
further propagated (remember that the reason for tion of input attributes and parameters. A derived
the application of a surrogate key transformation is provider relationship is another form of provider
to replace the production keys of the source data to relationship that captures the flow from the input
a homogeneous surrogate for the records of the to the respective output attributes.
data warehouse, which is independent of the source Formally, assume that (a) source is a term in
they have been collected from). Instead of the the architecture graph, (b) target is an attribute
values of the production key, the values from the of the output schema of an activity A and (c) x,y
attribute SKEY will be used to denote the unique are parameters in the parameter list of A (not
identifier for a part in the rest of the flow. necessary different). Then, a derived provider
In Fig. 6, we depict the LDL definition of this relationship pr(source, target) exists iff the
part of the motivating example. The three rules following regulator relationships (i.e., edges) exist:
correspond to the three categories of provider rr1(source, x) and rr2(y, target).

addSkey_in1(A_IN1_PKEY,A_IN1_DATE,A_IN1_QTY,A_IN1_COST,A_IN1_SOURCE)
ds_ps1(A_OUT_PKEY,A_OUT_DATE,A_OUT_QTY,A_OUT_COST,A_OUT_SOURCE),
A_OUT_PKEY=A_IN1_PKEY,
A_OUT_DATE=A_IN1_DATE,
A_OUT_QTY=A_IN1_QTY,
A_OUT_COST=A_IN1_COST,
A_OUT_SOURCE=A_IN1_SOURCE.

addSkey_out(A_OUT_PKEY,A_OUT_DATE,A_OUT_QTY,A_OUT_COST,A_OUT_SOURCE,A_OUT_SKEY)
addSkey_in1(A_IN1_PKEY,A_IN1_DATE,A_IN1_QTY,A_IN1_COST,A_IN1_SOURCE),
lookup(A_IN1_SOURCE,A_IN1_PKEY,A_OUT_SKEY),
A_OUT_PKEY=A_IN1_PKEY,
A_OUT_DATE=A_IN1_DATE,
A_OUT_QTY=A_IN1_QTY,
A_OUT_COST=A_IN1_COST,
A_OUT_SOURCE=A_IN1_SOURCE.

dw_partsupp(PKEY,DATE,QTY,COST,SOURCE)
addSkey_out(A_OUT_PKEY,A_OUT_DATE,A_OUT_QTY,A_OUT_COST,A_OUT_SOURCE,A_OUT_SKEY),
DATE=A_IN1_DATE,
QTY=A_IN1_QTY,
COST=A_IN1_COST
SOURCE=A_IN1_SOURCE,
PKEY=A_IN1_SKEY.

NOTE: For reasonsof readability we do not replace the Ain attribute names with
the activity name, i.e.,A_OUT_PKEYshould be diffPS1_OUT_PKEY.

Fig. 6. LDL specification of the motivating example.

ARTICLE IN PRESS


IN OUT
SK1

PAR

PKEY PKEY

SOURCE SOURCE
IN OUT
SK1
SKEY
PAR

PKEY

PKEY
SOURCE

SOURCE SKEY
PKEY
PKEY LPKEY
OUT
LOOKUP SOURCE
OUT
LOOKUP SOURCE LSOURCE
SKEY
SKEY LSKEY

Fig. 7. Derived provider relationships of the architecture graph: the original situation on the left and the derived provider relationships
on the right.

Intuitively, the case of derived relationships ships that do not involve constants (remember that
models the situation where the activity computes we have defined source as a term); (b) relation-
a new attribute in its output. In this case, the ships involving only attributes of the same/
produced output depends on all the attributes that different activity (as a measure of internal com-
populate the parameters of the activity, resulting plexity or external dependencies); (c) relationships
in the definition of the corresponding derived relating attributes that populate only the same
relationship. parameter (e.g., only the attributes LOOKUP.SKEY
Observe Fig. 7, where we depict a small part of and SK1.OUT.SKEY).
our running example. The left side of the figure
depicts the situation where only provider relation- 2.5. Scenarios
ships exist. The legend in the right side of Fig. 7
depicts how we compute the derived provider A scenario is an enumeration of activities along
relationships between the parameters of the with their source/target recordsets and the respec-
activity and the computed output attribute SKEY. tive provider relationships for each activity. An
The meaning of these five relationships is that ETL scenario consists of the following elements:
SK1.OUT.SKEY is not computed only from
attribute LOOKUP.SKEY, but from the combina- Name: A unique identifier for the scenario.
tion of all the attributes that populate the Activities: A finite list of activities. Note that by
parameters. employing a list (instead of e.g., a set) of
One can also assume different variations of activities, we impose a total ordering on the
derived provider relationships such as (a) relation- execution of the scenario.

ARTICLE IN PRESS


Entity Model-specific Scenario-specific

Data Types DI D
Built
-in
Function Types FI F
Constants CI C
Attributes ΩI Ω
Functions ΦI Φ
Schemata SI S
User-provided

RecordSets RSI RS
Activities AI A
Provider Relationships PrI Pr
Part-Of Relationships PoI Po
Instance-Of Relationships IoI Io
Regulator Relationships RrI Rr
Derived Provider Relationships DrI Dr

Fig. 8. Formal definition of domains and notation.

Recordsets: A finite set of recordsets. In the sequel, we treat the terms architecture
Targets: A special-purpose subset of the record- graph and scenario interchangeably. The reason-
sets of the scenario, which includes the final ing for the term ‘architecture graph’ goes all the
destinations of the overall process (i.e., the data way down to the fundamentals of conceptual
warehouse tables that must be populated by the modeling. As mentioned in [12], conceptual
activities of the scenario). models are the means by which designers conceive,
Provider relationships: A finite list of provider architect, design, and build software systems.
relationships among activities and recordsets of These conceptual models are used in the same
the scenario. way that blueprints are used in other engineering
disciplines during the early stages of the lifecycle of
In our modeling, a scenario is a set of activities, artificial systems, which involves the creation of
their architecture. The term ‘architecture graph’
deployed along a graph in an execution sequence
expresses the fact that the graph that we employ
that can be linearly serialized. For the moment, we
for the modeling of the data flow of the ETL
do not consider the different alternatives for the
scenario is practically acting as a blueprint of the
ordering of the execution; we simply require that a
architecture of this software artifact.
total order for this execution is present (i.e., each
Moreover, we assume the following integrity
activity has a discrete execution priority).
constraints for a scenario:
In terms of formal modeling of the architecture
graph, we assume the infinitely countable, mu-
Static constraints:
tually disjoint sets of names (i.e., the values of
which respect the unique name assumption) of All the weak entities of a scenario (i.e.,
column model-specific in Fig. 8. As far as a specific attributes or parameters) should be defined
scenario is concerned, we assume their respective within a part-of relationship (i.e., they should
finite subsets, depicted in column scenario-specific have a container object).
in Fig. 8. Data types, function types and constants All the mappings in provider relationships
are considered built-in’s of the system, whereas the should be defined among terms (i.e., attributes
rest of the entities are provided by the user (user or constants) of the same data type.
provided).
Formally, the architecture graph of an ETL
scenario is a graph G(V,E) defined as follows: Data flow constraints:
V ¼ D[F[C[X[/[S[RS[A All the attributes of the input schema(ta) of an
E ¼ Pr[Po[Io[Rr[Dr. activity should have a provider.

ARTICLE IN PRESS


Resulting from the previous requirement, if 3.1. General framework
some attribute is a parameter in an activity A,
the container of the attribute (i.e., recordset or Our philosophy during the construction of our
activity) should precede A in the scenario. metamodel was based on two pillars: (a) genericity,
All the attributes of the schemata of the target i.e., the derivation of a simple model, powerful to
recordsets should have a data provider. capture ideally all the cases of ETL activities and
(b) extensibility, i.e., the possibility of extending
Summarizing, in this section, we have presented the built-in functionality of the system with new,
a generic model for the modeling of the data flow user-specific templates.
for ETL workflows. In the next section, we will The genericity doctrine was pursued through the
proceed to detail how this generic model can be definition of a rather simple activity metamodel, as
accompanied by a customization mechanism, in described in Section 2. Still, providing a single
order to provide higher flexibility to the designer metaclass for all the possible activities of an ETL
of the workflow. environment is not really enough for the designer
of the overall process. A richer ‘‘language’’ should
be available, in order to describe the structure of
3. Templates for ETL activities the process and facilitate its construction. To this
end, we provide a palette of template activities,
In this section, we present the mechanism for which are specializations of the generic metamodel
exploiting template definitions of frequently used class.
ETL activities. The general framework for the Observe Fig. 9 for a further explanation of our
exploitation of these templates is accompanied framework. The lower layer of Fig. 9, namely
with the presentation of the language-related schema layer, involves a specific ETL scenario.
issues for template management and appropriate All the entities of the schema layer are instances of
examples. the classes Data Type, Function Type,

Datatypes Functions

Elementary Activity RecotdSet Relationships

Metamodel Layer

IsA

Domain Mismatch Source Table
Provider Re

NotNull SK Assignment Fact Table

Template Layer

InstanceOf

S1.PARTSUPF NN DM1 SK1 DW.PARTSUPP

Schema Layer

Fig. 9. The metamodel for the logical entities of the ETL environment.

ARTICLE IN PRESS


Elementary Activity, RecordSet and In the example of Fig. 9 the concept DW.
Relationship. Thus, as one can see on the PARTSUPP must be populated from a certain
upper part of Fig. 9, we introduce a meta-class source S1.PARTSUPP. Several operations must
layer, namely metamodel layer involving the intervene during the propagation. For instance in
aforementioned classes. The linkage between the Fig. 9, we check for null values and domain
metamodel and the schema layers is achieved violations, and we assign a surrogate key. As one
through instantiation (InstanceOf) relation- can observe, the recordsets that take part in this
ships. The metamodel layer implements the afore- scenario are instances of class RecordSet (be-
mentioned genericity desideratum: the classes longing to the metamodel layer) and specifically of
which are involved in the metamodel layer are its subclasses Source Table and Fact Table.
generic enough to model any ETL scenario, Instances and encompassing classes are related
through the appropriate instantiation. through links of type InstanceOf. The same
Still, we can do better than the simple provision mechanism applies to all the activities of
of a metalayer and an instance layer. In order to the scenario, which are (a) instances of class
make our metamodel truly useful for practi- Elementary Activity and (b) instances of
cal cases of ETL activities, we enrich it with a set one of its subclasses, depicted in Fig. 9. Relation-
of ETL-specific constructs, which constitute a ships do not escape this rule either. For instance,
subset of the larger metamodel layer, namely observe how the provider links from the concept
the template layer. The constructs in the template S1.PS toward the concept DW.PARTSUPP are
layer are also meta-classes, but they are related to class Provider Relationship
quite customized for the regular cases of ETL through the appropriate InstanceOf links.
activities. Thus, the classes of the template layer As far as the class Recordset is concerned, in
are specializations (i.e., subclasses) of the generic the template layer we can specialize it to several
classes of the metamodel layer (depicted as subclasses, based on orthogonal characteristics,
IsA relationships in Fig. 9). Through this custo- such as whether it is a file or RDBMS table, or
mization mechanism, the designer can pick the whether it is a source or target data store (as in
instances of the schema layer from a much Fig. 9). In the case of the class Relationship,
richer palette of constructs; in this setting, the there is a clear specialization in terms of the five
entities of the schema layer are instantiations, not classes of relationships which have already
only of the respective classes of the metamodel been mentioned in Section 2 (i.e., Provider,
layer, but also of their subclasses in the template Part-Of, Instance-Of, Regulator and
layer. Derived Provider).

Filters Unary operations Binary operations
- Selection (σ) - Push - Union (U)
- Not null (NN) - Aggregation (γ) - Join ( )
∆
∆

- Primary key - Projection (Π) - Diff (∆)
violation (PK) - Function application (f) - Update Detection (∆UPD)
- Foreign key - Surrogate key assignment (SK)
violation (FK) - Tuple normalization (N)
- Unique value (UN) - Tuple denormalization(DN)
- Domain mismatch (DM)
File operations Transfer operations
- EBCDIC to ASCII conversion - Ftp (FTP)
(EB2AS) - Compress / Decompress (Z/dZ)
- Sort file (Sort) - Encrypt / Decrypt (Cr/dCr)

Fig. 10. Template activities, along with their graphical notation symbols, grouped by category.

ARTICLE IN PRESS


Following the same framework, class Elemen- frequent elements of ETL scenarios. Moreover,
tary Activity is further specialized to an apart from this ‘‘built-in’’, ETL-specific extension
extensible set of reoccurring patterns of ETL of the generic metamodel, if the designer decides
activities, depicted in Fig. 10. As one can see on that several ‘patterns’, not included in the palette
the top side of Fig. 9, we group the template of the template layer, occur repeatedly in his data
activities in five major logical groups. We do not warehousing projects, he can easily fit them into
depict the grouping of activities in subclasses in the customizable template layer through a specia-
Fig. 9, in order to avoid overloading the figure; lization mechanism.
instead, we depict the specialization of class
Elementary Activity to three of its subclasses 3.2. Formal definition and usage of template
whose instances appear in the employed scenario activities
of the schema layer. We now proceed to present
each of the aforementioned groups in more detail. Once the template layer has been introduced,
The first group, named filters, provides checks the obvious issue that is raised is its linkage with
for the satisfaction (or not) of a certain condition. the employed declarative language of our frame-
The semantics of these filters are the obvious work. In general, the broader issue is the usage of
(starting from a generic selection condition the template mechanism from the user; to this end,
and proceeding to the check for null values, we will explain the substitution mechanism for
primary or foreign key violation, etc.). templates in this subsection and refer the interested
The second group of template activities is called reader to [13] for a presentation of the specific
unary operations and except for the most generic templates that we have constructed.
push activity (which simply propagates data from A template activity is formally defined by the
the provider to the consumer), consists of the following elements:
classical aggregation and function appli-
cation operations along with three data ware- Name: A unique identifier for the template
house specific transformations (surrogate key activity.
assignment, normalization and denorma- Parameter list: A set of names which act as
lization). The third group consists of classical regulators in the expression of the semantics of
binary operations, such as union, join and the template activity. For example, the para-
difference of recordsets/activities as well as meters are used to assign values to constants,
with a special case of difference involving the create dynamic mapping at instantiation time,
detection of updates. Except for the afore- etc.
mentioned template activities, which mainly refer Expression: A declarative statement describing
to logical transformations, we can also consider the operation performed by the instances of the
the case of physical operators that refer to the template activity. As with elementary activities,
application of physical transformations to whole our model supports LDL as the formalism for
files/tables. In the ETL context, we are mainly the expression of this statement.
interested in operations like transfer operations Mapping: A set of bindings, mapping input to
(ftp, compress/decompress, encrypt/ output attributes, possibly through intermediate
decrypt) and file operations (EBCDIC to AS- placeholders. In general, mappings at the
CII, sort file). template level try to capture a default way of
Summarizing, the metamodel layer is a set of propagating incoming values from the input
generic entities, able to represent any ETL towards the output schema. These default
scenario. At the same time, the genericity of the bindings are easily refined and possibly rear-
metamodel layer is complemented with the exten- ranged at instantiation time.
sibility of the template layer, which is a set of
‘‘built-in’’ specializations of the entities of the The template mechanism we use is a substitution
metamodel layer, specifically tailored for the most mechanism, based on macros, that facilitates the

ARTICLE IN PRESS


automatic creation of LDL code. This simple which returns the arity of the respective schema,
notation and instantiation mechanism permits the mainly in order to define upper bounds in loop
easy and fast registration of LDL templates. In the iterators.
rest of this section, we will elaborate on the
notation, instantiation mechanisms and template Loops: Loops are a powerful mechanism that
taxonomy particularities. enhances the genericity of the templates by
allowing the designer to handle templates with
3.2.1. Notation unknown number of variables and with unknown
Our template notation is a simple language arity for the input/output schemata. The general
featuring five main mechanisms for dynamic form of loops is
production of LDL expressions: (a) variables that ½hsimple constraintiŠfhloop bodyig;
are replaced by their values at instantiation
time; (b) a function that returns the arity of an where simple constraint has the form:
input, output or parameter schema; (c) loops, hlower boundi hcomparison operatori hiteratori
where the loop body is repeated at instantiation
hcomparison operatori hupper boundi:
time as many times as the iterator constraint
defines; (d) keywords to simplify the creation We consider only linear increase with step equal
of unique predicate and attribute names; and, to 1, since this covers most possible cases. Upper
finally, (e) macros which are used as syntactic bound and lower bound can be arithmetic
sugar to simplify the way we handle complex expressions involving arityOf() function calls,
expressions (especially in the case of variable size variables and constants. Valid arithmetic opera-
schemata). tors are +, À, /, * and valid comparison operators
are o, 4, ¼ , all with their usual semantics. If
Variables: We have two kinds of variables in the lower bound is omitted, 1 is assumed. During each
template mechanism: parameter variables and loop iteration the loop body will be reproduced and at
iterators. Parameter variables are marked with a @ the same time all the marked appearances of the
symbol at their beginning and they are replaced by loop iterator will be replaced by its current value,
user-defined values at instantiation time. A list of as described before. Loop nesting is permitted.
an arbitrary length of parameters is denoted by
@/parameter nameS[ ]. For such lists, the Keywords: Keywords are used in order to refer
user has to explicitly or implicitly provide their to input and output schemata. They provide two
length at instantiation time. Loop iterators, on the main functionalities: (a) they simplify the reference
other hand, are implicitly defined in the loop to the input output/schema by using standard
constraint. During each loop iteration, all the names for the predicates and their attributes, and
properly marked appearances of the iterator in the (b) they allow their renaming at instantiation time.
loop body are replaced by its current value This is done in such a way that no different
(similarly to the way the C preprocessor treats predicates with the same name will appear in the
#DEFINE statements). Iterators that appear same program, and no different attributes with the
marked in loop body are instantiated even when same name will appear in the same rule. Keywords
they are a part of another string or of a variable are recognized even if they are parts of another
name. We mark such appearances by enclosing string, without a special notation. This facilitates a
them with $. This functionality enables referencing homogenous renaming of multiple distinct input
all the values of a parameter list and facilitates the schemata at template level, to multiple distinct
creation of an arbitrary number of pre-formatted schemata at instantiation, with all of them having
strings. unique names in the LDL program scope. For
example, if the template is expressed in terms of
Functions: We employ a built-in function, ari- two different input schemata a_in1 and a_in2,
tyOf(/input/output/parameter schemaS ), at instantiation time they will be renamed to

ARTICLE IN PRESS


Keyword Usage Example

A unique name for the output/input schema
a_out of the activity. The predicate that is difference3_out
produced when this template is instantiated
a_in has the form: difference3_in
unique_pred_name_out (or, _in respectively)

A_OUT/A_IN is used for constructing the names
of the a_out/a_in attributes. The names
A_OUT DIFFERENCE3_OUT
produced have the form:
A_IN DIFFERENCE3_IN
predicate unique name in upper case_OUT
(or, _IN respectively)

Fig. 11. Keywords for templates.

dm1_in1 and dm1_in2 so that the produced definition of a template for a simple relational
names will be unique throughout the scenario selection:
program. In Fig. 11, we depict the way the
a_out([ioarityOf(a_out)]{A_OUT_$i$,}
renaming is performed at instantiation time.
[i ¼ arityOf(a_out)]{A_OUT_$i$}) o-
a_in1([ioarityOf(a_in1)]
Macros: To make the definition of templates
{A_IN1_$i$,} [i ¼ arityOf(a_in1)]
easier and to improve their readability, we
{A_IN1_$i$}),
introduce a macro to facilitate attribute and
expr([ioarityOf(@PARAM)]
variable name expansion. For example, one of
{@PARAM[$i$],}
the major problems in defining a language for
[i ¼ arityOf(@PARAM)]
templates is the difficulty of dealing with schemata
{@PARAM[$i$]}),
of arbitrary arity. Clearly, at the template level, it
[ioarityOf(a_out)]
is not possible to pin-down the number of
{A_OUT_$i$ ¼ A_IN1_$i$,}
attributes of the involved schemata to a specific
[i ¼ arityOf(a_out)]
value. For example, in order to create a series of
{A_OUT_$i$ ¼ A_IN1_$i$}
names like the following
As already mentioned at the syntax for loops, the
name_theme_1,name_theme_2,y, expression
name_theme_k
[ioarityOf(a_out)]{A_OUT_$i$,}
we need to give the following expression: [i ¼ arityOf(a_out)]{A_OUT_$i$}

[iteratoromaxLimit] defining the attributes of the output schema
{name_theme$iterator$} a_out simply wants to list a variable number of
[iterator ¼ maxLimit] attributes that will be fixed at instantiation time.
{name_theme$iterator$} Exactly the same tactics apply for the attributes of
the predicate names a_in1 and expr. Also, the
Obviously, this results in making the writing of final two lines state that each attribute of the
templates hard and reduces their readability. To output will be equal to the respective attribute of
attack this problem, we resort to a simple reusable the input (so that the query is safe), e.g.,
macro mechanism that enables the simplification A_OUT_4 ¼ A_IN1_4. We can simplify the
of employed expressions. For example, observe the definition of the template by allowing the designer

ARTICLE IN PRESS


to define certain macros that simplify the manage- 4. All the rest parameter variables are instantiated.
ment of temporary length attribute lists. We 5. Keywords are recognized and renamed.
employ the following macros:
We will try to explain briefly the intuition
DEFINE INPUT_SCHEMA AS
behind this execution order. Macros are expanded
[ioarityOf(a_in1)]{A_IN1_$i$,}
first. Step (2) proceeds step (3) because loop
[i ¼ arityOf(a_in1)] {A_IN1_$i$}
boundaries have to be calculated before loop
DEFINE OUTPUT_SCHEMA AS
productions are performed. Loops on the other
[ioarityOf(a_in)]{A_OUT_$i$,}
hand, have to be expanded before parameter
[i ¼ arityOf(a_out)]{A_OUT_$i$}
variables are instantiated, if we want to be able
DEFINE PARAM_SCHEMA AS to reference lists of variables. The only exception
[ioarityOf(@PARAM)]{@PARAM[$i$],}
to this is the parameter variables that appear in the
[i ¼ arityOf(@PARAM)]{@PARAM[$i$]}
loop boundaries, which have to be calculated first.
DEFINE DEFAULT_MAPPING AS
Notice though, that variable list elements cannot
[ioarityOf(a_out)]
appear in the loop constraint. Finally, we have to
{A_OUT_$i$ ¼ A_IN1_$i$,}
instantiate variables before keywords since vari-
[i ¼ arityOf(a_out)]
ables are used to create a dynamic mapping
{A_OUT_$i$ ¼ A_IN1_$i$}
between the input/output schemata and other
attributes.
Then, the template definition is as follows:
Fig. 12 shows a simple example of template
a_out(OUTPUT_SCHEMA) o- instantiation for the function application activity.
a_in1(INPUT_SCHEMA), To understand the overall process better, first
expr(PARAM_SCHEMA), observe the outcome of it, i.e., the specific activity
DEFAULT_MAPPING which is produced, as depicted in the final row of
Fig. 12, labeled keyword renaming. The output
schema of the activity, fa12_out, is the head of
the LDL rule that specifies the activity. The body
3.2.2. Instantiation
of the rule says that the output records are
Template instantiation is the process where the
specified by the conjunction of the following
user chooses a certain template and creates a
clauses: (a) the input schema myFunc_in, (b)
concrete activity out of it. This procedure requires
the application of function subtract over the
that the user specifies the schemata of the activity
attributes COST_IN, PRICE_IN and the produc-
and gives concrete values to the template para-
tion of a value PROFIT, and (c) the mapping of
meters. Then, the process of producing the
respective LDL description of the activity is easily the input to the respective output attributes as
specified in the last three conjuncts of the rule.
automated. Instantiation order is important in our
The first row, template, shows the initial
template creation mechanism, since, as it can easily
template as it has been registered by the designer.
been seen from the notation definitions, different
@FUNCTION holds the name of the function to be
orders can lead to different results. The instantia-
used, subtract in our case, and the @PARAM[ ]
tion order is as follows:
holds the inputs of the function, which in our case
1. Replacement of macro definitions with their are the two attributes of the input schema. The
expansions. problem we have to face is that all input, output
2. arityOf() functions and parameter variables and function schemata have a variable number of
appearing in loop boundaries are calculated parameters. To abstract from the complexity of
first. this problem, we define four macro definitions, one
3. Loop productions are performed by instantiat- for each schema (INPUT_SCHEMA, OUTPUT_
ing the appearances of the iterators. This leads SCHEMA, FUNCTION_INPUT) along with a macro
to intermediate results without any loops. for the mapping of input to output attributes

ARTICLE IN PRESS


Fig. 12. Instantiation procedure.

(DEFAULT_MAPPING). The second row, macro also shown in the last two lines of the template. In
expansion, shows how the template looks after the the third row, parameter instantiation, we can see
macros have been incorporated in the template how the parameter variables were materialized at
definition. The mechanics of the expansion are instantiation. In the fourth row, loop production,
straightforward: observe how the attributes of the we can see the intermediate results after the loop
output schema are specified by the expression expansions are done. As it can easily be seen, these
[ioarityOf(a_in)+1]{A_OUT_$i$,}OUT- expansions must be done before @PARAM[]
FIELD as an expansion of the macro OUTPUT_ variables are replaced by their values. In the fifth
SCHEMA. In a similar fashion, the attributes of the row, variable instantiation, the parameter variables
input schema and the parameters of the function have been instantiated creating a default mapping
are also specified; note that the expression for the between the input, the output and the function
last attribute in the list is different (to avoid attributes. Finally, in the last row, keyword
repeating an erroneous comma). The mappings renaming, the output LDL code is presented after
between the input and the output attributes are the keywords are renamed. Keyword instantiation

Etl design document

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Etl design document

Similar to Etl design document (20)

Etl design document