Etl design document


Published on


  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Etl design document

  1. 1. ARTICLE IN PRESS Information Systems 30 (2005) 492–525 A generic and customizable framework for the design of ETL scenarios Panos Vassiliadisa, Alkis Simitsisb, Panos Georgantasb, Manolis Terrovitisb, Spiros Skiadopoulosb a Department of Computer Science, University of Ioannina, Ioannina, Greece b Department of Electrical and Computer Engineering, National Technical University of Athens, Athens, GreeceAbstract Extraction–transformation–loading (ETL) tools are pieces of software responsible for the extraction of data fromseveral sources, their cleansing, customization and insertion into a data warehouse. In this paper, we delve into thelogical design of ETL scenarios and provide a generic and customizable framework in order to support the DWdesigner in his task. First, we present a metamodel particularly customized for the definition of ETL activities. Wefollow a workflow-like approach, where the output of a certain activity can either be stored persistently or passed to asubsequent activity. Also, we employ a declarative database programming language, LDL, to define the semantics ofeach activity. The metamodel is generic enough to capture any possible ETL activity. Nevertheless, in the pursuit ofhigher reusability and flexibility, we specialize the set of our generic metamodel constructs with a palette of frequentlyused ETL activities, which we call templates. Moreover, in order to achieve a uniform extensibility mechanism for thislibrary of built-ins, we have to deal with specific language issues. Therefore, we also discuss the mechanics of templateinstantiation to concrete activities. The design concepts that we introduce have been implemented in a tool, ARKTOS II,which is also presented.r 2004 Elsevier B.V. All rights reserved.Keywords: Data warehousing; ETL1. Introduction data extraction, transformation, integration, cleaning and transport. To deal with this work- Data warehouse operational processes normally flow, specialized tools are already available in thecompose a labor-intensive workflow, involving market [1–4], under the general title Extraction— Transformation– Loading (ETL) tools. To give a general idea of the functionality of these tools we E-mail addresses: (P. Vassiliadis), (A. Simitsis), mention their most prominent tasks, which include(P. Georgantas), (M. Terrovitis), (a) the identification of relevant information at (S. Skiadopoulos). source side, (b) the extraction of this information,0306-4379/$ - see front matter r 2004 Elsevier B.V. All rights reserved.doi:10.1016/
  2. 2. ARTICLE IN PRESS P. Vassiliadis et al. / Information Systems 30 (2005) 492–525 493 Logical Perspective Physical Perspective Execution Plan Resource Layer Execution Sequence Execution Schedule Recovery Plan Relationship with data Operational Layer Primary Data Flow Data Flowfor Logical Exceptions Administration Plan Monitoring & Logging Security & Access Rights Management Fig. 1. Different perspectives for an ETL workflow.(c) the customization and integration of the defining an execution plan for the scenario. Theinformation coming from multiple sources into a definition of an execution plan can be seen fromcommon format, (d) the cleaning of the resulting various perspectives. The execution sequence in-data set on the basis of database and business volves the specification of which activity runs first,rules, and (e) the propagation of the data to the second, and so on, which activities run in parallel,data warehouse and/or data marts. or when a semaphore is defined so that several If we treat an ETL scenario as a composite activities are synchronized at a rendezvous point.workflow, in a traditional way, its designer is ETL activities normally run in batch, so theobliged to define several of its parameters (Fig. 1). designer needs to specify an execution schedule,Here, we follow a multi-perspective approach that i.e., the time points or events that trigger theenables to separate these parameters and study execution of the scenario as a whole. Finally, duethem in a principled approach. We are mainly to system crashes, it is imperative that there existsinterested in the design and administration parts of a recovery plan, specifying the sequence of steps tothe lifecycle of the overall ETL process, and we be taken in the case of failure for a certain activitydepict them at the upper and lower part of Fig. 1, (e.g., retry to execute the activity, or undo anyrespectively. At the top of Fig. 1, we are mainly intermediate results produced so far). On the right-concerned with the static design artifacts for a hand side of Fig. 1, we can also see the physicalworkflow environment. We will follow a tradi- perspective, involving the registration of the actualtional approach and group the design artifacts into entities that exist in the real world. We will reuselogical and physical, with each category compris- the terminology of [5] for the physical its own perspective. We depict the logical The resource layer comprises the definition of rolesperspective on the left-hand side of Fig. 1, and the (human or software) that are responsible forphysical perspective on the right-hand side. At the executing the activities of the workflow. Thelogical perspective, we classify the design artifacts operational layer, at the same time, comprises thethat give an abstract description of the workflow software modules that implement the designenvironment. First, the designer is responsible for entities of the logical perspective in the real world.
  3. 3. ARTICLE IN PRESS494 P. Vassiliadis et al. / Information Systems 30 (2005) 492–525In other words, the activities defined at the logical activities, which we call templates. Moreover, inlayer (in an abstract way) are materialized and order to achieve a uniform extensibility mechan-executed through the specific software modules of ism for this library of built-ins, we have to dealthe physical perspective. with specific language issues: thus, we also discuss At the lower part of Fig. 1, we are dealing with the mechanics of template instantiation to concretethe tasks that concern the administration of the activities. The design concepts that we introduceworkflow environment and their dynamic beha- have been implemented in a tool, ARKTOS II, whichvior at runtime. First, an administration plan is also presented.should be specified, involving the notification of Our contributions can be listed as follows:the administrator either on-line (monitoring) oroff-line (logging) for the status of an executed First, we define a formal metamodel as anactivity, as well as the security and authentication abstraction of ETL processes at the logical for the ETL environment. The data stores, activities and their constituent We find that research has not dealt with the parts are formally defined. An activity is defineddefinition of data-centric workflows to the entirety as an entity with possibly more than one inputof its extent. In the ETL case, for example, due to schemata, an output schema and a parameterthe data centric nature of the process, the designer schema, so that the activity is populated eachmust deal with the relationship of the involved time with its proper parameter values. The flowactivities with the underlying data. This involves the of data from producers towards their consumersdefinition of a primary data flow that describes the is achieved through the usage of providerroute of data from the sources towards their final relationships that map the attributes of thedestination in the data warehouse, as they pass former to the respective attributes of the latter.through the activities of the scenario. Also, due to A serializable combination of ETL activities,possible quality problems of the processed data, provider relationships and data stores constitu-the designer is obliged to define a data flow for tes an ETL scenario.logical exceptions, i.e., a flow for the problematic Second, we provide a reusability framework thatdata, i.e., the rows that violate integrity or business complements the genericity of the metamodel.rules. It is the combination of the execution Practically, this is achieved from a set of ‘‘built-sequence and the data flow that generates the in’’ specializations of the entities of the meta-semantics of the ETL workflow: the data flow model layer, specifically tailored for the mostdefines what each activity does and the execution frequent elements of ETL scenarios. This paletteplan defines in which order and combination. of template activities will be referred to as In this paper, we work in the internals of the template layer and it is characterized by itsdata flow of ETL scenarios. First, we present a extensibility; in fact, due to language considera-metamodel particularly customized for the defini- tions, we provide the details of the mechanismtion of ETL activities. We follow a workflow-like that instantiates templates to specific activities.approach, where the output of a certain activity Finally, we discuss implementation issues and wecan either be stored persistently or passed to a present a graphical tool, ARKTOS II that facil-subsequent activity. Moreover, we employ a itates the design of ETL scenarios, based on ourdeclarative database programming language, model.LDL, to define the semantics of each activity.The metamodel is generic enough to capture any This paper is organized as follows. In Section 2,possible ETL activity; nevertheless, reusability and we present a generic model of ETL activities.ease-of-use dictate that we can do better in aiding Section 3 describes the mechanism for specifyingthe data warehouse designer in his task. In this and materializing template definitions of fre-pursuit of higher reusability and flexibility, we quently used ETL activities. Section 4 presentsspecialize the set of our generic metamodel ARKTOS II, a prototype graphical tool. In Section 5,constructs with a palette of frequently used ETL we survey related work. In Section 6, we make a
  4. 4. ARTICLE IN PRESS P. Vassiliadis et al. / Information Systems 30 (2005) 492–525 495general discussion on the completeness and general staging area (DSA)1 DS. The scenario involves theapplicability of our approach. Section 7 offers propagation of data from the table PARTSUPP ofconclusions and presents topics for future re- source S1 to the data warehouse DW. Tablesearch. Short versions of parts of this paper have DW.PARTSUPP (PKEY, SOURCE, DATE, QTY,been presented in [6,7]. COST) stores information for the available quan- tity (QTY) and cost (COST) of parts (PKEY) per source (SOURCE). The data source S1.2. Generic model of ETL activities PARTSUPP (PKEY, DATE, QTY, COST) records the supplies from a specific geographical region, The purpose of this section is to present a formal e.g., Europe. All the attributes, except for the dateslogical model for the activities of an ETL are instances of the Integer type. The scenario isenvironment. This model abstracts from the graphically depicted in Fig. 3 and involves thetechnicalities of monitoring, scheduling and log- following transformations.ging while it concentrates on the flow of data fromthe sources towards the data warehouse through 1. First, we transfer via FTP_PS1 the snapshotthe composition of activities and data stores. The from the source S1.PARTSUPP to the filefull layout of an ETL scenario, involving activities, DS.PS1_NEW of the DSA.2recordsets and functions can be modeled by a 2. In the DSA, we maintain locally a copy of thegraph, which we call the architecture graph. We snapshot of the source as it was at the previousemploy a uniform, graph-modeling framework for loading (we assume here the case of theboth the modeling of the internal structure of incremental maintenance of the DW, instead ofactivities and the ETL scenario at large, which the case of the initial loading of the DW). Theenables the treatment of the ETL environment recordset DS.PS1_NEW (PKEY, DATE, QTY,from different viewpoints. First, the architecture COST) stands for the last transferred snapshotgraph comprises all the activities and data stores of of S1.PARTSUPP. By detecting the differencea scenario, along with their components. Second, of this snapshot with the respective version ofthe architecture graph captures the data flow the previous loading, DS.PS1_OLD (PKEY,within the ETL environment. Finally, the informa- DATE, QTY, COST), we can derive the newlytion on the typing of the involved entities and the inserted rows in S1.PARTSUPP. Note that theregulation of the execution of a scenario, through difference activity that we employ, namelyspecific parameters are also covered. Diff_PS1, checks for differences only on the primary key of the recordsets; thus, we ignore2.1. Graphical notation and motivating example here any possible deletions or updates for the attributes COST, QTY of existing rows. Any not Being a graph, the architecture graph of an ETL newly inserted row is rejected and so, it isscenario comprises nodes and edges. The involved propagated to Diff_PS1_REJ that stores alldata types, function types, constants, attributes, the rejected rows. The schema of Diff_PS1_activities, recordsets, parameters and functions REJ is identical to the input schema of theconstitute the nodes of the graph. The different activity Diff_PS1.kinds of relationships among these entities aremodeled as the edges of the graph. In Fig. 2, we 1 In data warehousing terminology a DSA is an intermediategive the graphical notation for all the modeling area of the data warehouse, specifically destined to enable theconstructs that will be presented in the sequel. transformation, cleaning and integration of source data, before Motivating example: To motivate our discus- being loaded to the warehouse. 2sion, we will present an example involving the The technical points, like FTP, are mostly employed to show what kind of problems someone has to deal with in a practicalpropagation of data from a certain source S1, situation, rather than to relate this kind of physical operationstowards a data warehouse DW through intermedi- to a logical model. In terms of logical modelling this is a simpleate recordsets. These recordsets belong to a data passing of data from one site to another.
  5. 5. ARTICLE IN PRESS496 P. Vassiliadis et al. / Information Systems 30 (2005) 492–525 Data Types Black ellipsoid RecordSets Cylinders Function Black rectangles Functions Gray rectangles Types Constants Black circles 1 Parameters White rectangles Attributes Unshaded ellipsoid Activities Triangles Provider Bold solid arrows Part-Of Simple lines with Relationships (from provider to Relationships diamond edges* consumer) Bold dotted Dotted arrows Derived Instance-Of arrows (from (from instance Provider Relationships provider to towards the type) Relationships consumer) * We annotate the part-of relationship among a Regulator Dotted lines function and its return type with a directed edge, to Relationships distinguish it from therest of the parameters. Fig. 2. Graphical notation for the architecture graph. DS.PS1.PKEY DS.PS1_NEW.PKEY LOOKUP.PKEY = COST SOURCE = 1 LOOKUP.SOURCE DS.PS1_OLD.PKEY LOOKUP.SKEY S1.PARTSUPP FTP_PS1 DS.PS1_NEW Diff_PS1 NotNu111 Add_Attr1 DS.PS1 SK1 DW.PARTSUPP Source DS.PS1_OLD Data Warehouse Diff_PS1 Not Nul 111 LOOKUP _REJ _ REJ DSA Fig. 3. Bird’s-eye view of the motivating example.3. The rows that pass the activity Diff_PS1 are Diff_PS1_REJ recordset for further examina- checked for null values of the attribute COST tion by the data warehouse administrator. through the activity NotNull1. Rows having a 4. Although we consider the data flow for only NULL value for their COST are kept in the one source, namely S1, the data warehouse can
  6. 6. ARTICLE IN PRESS P. Vassiliadis et al. / Information Systems 30 (2005) 492–525 497 clearly have more sources for part supplies. In Finally, before proceeding, we would like to order to keep track of the source of each row stress that we do not anticipate a manual entering into the DW, we need to add a ‘flag’ construction of the graph by the designer; rather, attribute, namely SOURCE, indicating the re- we employ this section to clarify how the graph spective source. This is achieved through the will look once constructed. To assist a more activity Add_Attr1. We store the rows that automatic construction of ETL scenarios, we have stem from this process in the recordset DS.PS1 implemented the ARKTOS II tool that supports the (PKEY, SOURCE, DATE, QTY, COST). designing process through a friendly GUI. We5. Next, we assign a surrogate key on PKEY. In the present ARKTOS II in Section 4. data warehouse context, it is common tactics to replace the keys of the production systems with 2.2. Preliminaries a uniform key, which we call a surrogate key [8]. The basic reasons for this replacement are In this subsection, we will introduce the formal performance and semantic homogeneity. Tex- modeling of data types, data stores and functions, tual attributes are not the best candidates for before proceeding to the modeling of ETL indexed keys and thus, they need to be replaced activities. by integer keys. At the same time, different production systems might use different keys for Elementary entities: We assume the existence of the same object, or the same key for different a countable set of data types. Each data type T is objects, resulting in the need for a global characterized by a name and a domain, i.e., a replacement of these values in the data ware- countable set of values, called dom (T). The house. This replacement is performed through a values of the domains are also referred to as lookup table of the form L (PRODKEY, constants. SOURCE, SKEY). The SOURCE column is due We also assume the existence of a countable set to the fact that there can be synonyms in the of attributes, which constitute the most elementary different sources, which are mapped to different granules of the infrastructure of the information objects in the data warehouse. In our case, the system. Attributes are characterized by their name activity that performs the surrogate key assign- and data type. The domain of an attribute is a ment for the attribute PKEY is SK1. It uses the subset of the domain of its data type. Attributes lookup table LOOKUP (PKEY, SOURCE, and constants are uniformly referred to as terms. SKEY). Finally, we populate the data ware- house with the output of the previous activity. A schema is a finite list of attributes. Each entity that is characterized by one or more schemata will The role of rejected rows depends on the be called structured entity. Moreover, we assumepeculiarities of each ETL scenario. If the designer the existence of a special family of schemata, allneeds to administrate these rows further, then he/ under the general name of NULL schema,she should use intermediate storage recordsets determined to act as placeholders for data whichwith the burden of an extra I/O cost. If the rejected are not to be stored permanently in some datarows should not have a special treatment, then the store. We refer to a family instead of a singlebest solution is to be ignored; thus, in this case we NULL schema, due to a subtle technicalityavoid overloading the scenario with any extra involving the number of attributes of such astorage recordset. In our case, we annotate only schema (this will become clear in the sequel).two of the presented activities with a destina-tion for rejected rows. Out of these, while Recordsets: We define a record as the instantia-NotNull1_REJ absolutely makes sense as a tion of a schema to a list of values belonging toplaceholder for problematic rows having non- the domains of the respective schema attributes.acceptable NULL values, Diff_PS1_REJ is pre- We can treat any data structure as a re-sented for demonstration reasons only. cordset provided that there are ways to logically
  7. 7. ARTICLE IN PRESS498 P. Vassiliadis et al. / Information Systems 30 (2005) 492–525restructure it into a flat, typed record schema. LDL, we avoid dealing with the peculiarities of aFormally, a recordset is characterized by its name, particular programming language. Once again, weits (logical) schema and its (physical) extension want to stress that the presented LDL description(i.e., a finite set of records under the recordset is intended to capture the semantics of eachschema). If we consider a schema S ¼ activity, instead of the way these activities are[A1, y, Ak], for a certain recordset, its extension actually a mapping S ¼ [A1, y, Ak]-dom(A1)  y An elementary activity is formally described by  dom(Ak). Thus, the extension of the recordset the following elements:is a finite subset of dom(A1)  y  dom(Ak) anda record is the instance of a mapping dom(A1) Name: A unique identifier for the activity.  y  dom(Ak)-[x1,y,xk], xiAdom(Ai). Input schemata: A finite set of one or more input schemata that receives data from the dataIn the rest of this paper we will mainly deal with providers of the activity.the two most popular types of recordsets, namelyrelational tables and record files. A database is a Output schema: A schema that describes the placeholder for the rows that pass the checkfinite set of relational tables. performed by the elementary activity. Functions. We assume the existence of a Rejections schema: A schema that describes the placeholder for the rows that do not pass thecountable set of built-in system function types. A check performed by the activity, or their valuesfunction type comprises a name, a finite list of are not appropriate for the performed transfor-parameter data types, and a single return data type. mation.A function is an instance of a function type.Consequently, it is characterized by a name, a list Parameter list: A set of pairs which act as regulators for the functionality of the activityof input parameters and a parameter for its return (the target attribute of a foreign key check, forvalue. The data types of the parameters of the example). The first component of the pair is agenerating function type also define (a) the data name and the second is a schema, an attribute, atypes of the parameters of the function and (b) the function or a candidates for the function parameters (i.e.,attributes or constants of a suitable data type). Output operational semantics: An LDL state- ment describing the content passed to the output of the operation, with respect to its2.3. Activities input. This LDL statement defines (a) the operation performed on the rows that pass Activities are the backbone of the structure of through the activity and (b) an implicit mappingany information system. We adopt the WfMC between the attributes of the input schema(ta)terminology [9] for processes/programs and we will and the respective attributes of the outputcall them activities in the sequel. An activity is an schema.amount of ‘‘work which is processed by acombination of resource and computer applica- Rejection operational semantics: An LDL state- ment describing the rejected records, in a sensetions’’ [9]. In our framework, activities are logical similar to the output operational semantics.abstractions representing parts or full modules of This statement is by default considered to be thecode. complement of the output operational seman- The execution of an activity is performed from a tics, except if explicitly defined differently.particular program. Normally, ETL activities willbe either performed in a black-box manner by a There are two issues that we would like todedicated tool, or they will be expressed in some elaborate on, here:language (e.g., PL/SQL, Perl, C). Still, we want todeal with the general case of ETL activities. We NULL schemata: Whenever we do not specifyemploy an abstraction of the source code of an a data consumer for the output or rejec-activity, in the form of an LDL statement. Using tion schemata, the respective NULL schema
  8. 8. ARTICLE IN PRESS P. Vassiliadis et al. / Information Systems 30 (2005) 492–525 499 (involving the correct number of attributes) is notation of entities (nodes) and relationships implied. This practically means that the data (edges) is presented in Fig. 2. targeted for this schema will neither be stored to some persistent data store, nor will they be Part-of relationships. These relationships in- propagated to another activity, but they will volve attributes and parameters and relate them simply be ignored. to the respective activity, recordset or function Language issues: Initially, we used to specify the to which they belong. semantics of activities with SQL statements. Instance-of relationships. These relationships are Still, although clear and easy to write and defined among a data/function type and its understand, SQL is rather hard to use if one is instances. to perform rewriting and composition of state- Provider relationships. These are relationships ments. Thus, we have supplemented SQL with that involve attributes with a provider–consu- LDL [10], a logic programming, declarative mer relationship. language as the basis of our scenario definition. Regulator relationships. These relationships are LDL is a Datalog variant based on a Horn- defined among the parameters of activities and clause logic that supports recursion, complex the terms that populate these activities. objects and negation. In the context of its Derived provider relationships. A special case of implementation in an actual deductive database provider relationships that occurs whenever management system, LDL++ [11], the lan- output attributes are computed through the guage has been extended to support external composition of input attributes and parameters. functions, choice, aggregation (and even, user- Derived provider relationships can be deduced defined aggregation), updates and several other from a simple rule and do not originally features. constitute a part of the graph.2.4. Relationships in the architecture graph In the rest of this subsection, we will detail the notions pertaining to the relationships of the In this subsection, we will elaborate on the Architecture Graph; the knowledgeable reader isdifferent kinds of relationships that the entities of referred to Section 2.5 where we discuss the issuean ETL scenario have. Whereas these entities are of scenarios. We will base our discussions on amodeled as the nodes of the architecture graph, part of the scenario of the motivating examplerelationships are modeled as its edges. Due to their (presented in Section 2.1), including activity SK1.diversity, before proceeding, we list these types ofrelationships along with the related terminology Data types and instance-of relationships: Tothat we will use in this paper. The graphical capture typing information on attributes and OUT IN OUT IN DW.PARTS DS.PS1 SK1 UPP PKEY PKEY PKEY PKEY Integer QTY QTY QTY QTY COST COST COST COST Date DATE DATE DATE DATE SOURCE SOURCE SOURCE SOURCE SKEY Fig. 4. Instance-of relationships of the architecture graph.
  9. 9. ARTICLE IN PRESS500 P. Vassiliadis et al. / Information Systems 30 (2005) 492–525functions, the architecture graph comprises data tion schema of the activity, respectively). Natu-and function types. Instantiation relationships are rally, if the activity involves more than one inputdepicted as dotted arrows that stem from the schemata, the relationship is tagged with an INiinstances and head toward the data/function types. tag for the ith input schema. We also incorporateIn Fig. 4, we observe the attributes of the two the functions along with their respective para-activities of our example and their correspondence meters and the part-of relationships among theto two data types, namely integer and date. former and the latter. We annotate the part-ofFor reasons of presentation, we merge several relationship with the return type with a directedinstantiation edges so that the figure does not edge, to distinguish it from the rest of thebecome too crowded. parameters. Fig. 5 depicts a part of the motivating example. Attributes and part-of relationships: The first In terms of part-of relationships, we present thething to incorporate in the architecture graph is decomposition of (a) the recordsets DS.PS1,the structured entities (activities and recordsets) LOOKUP, DW.PARTSUPP and (b) the activity SK1along with all the attributes of their schemata. We and the attributes of its input and outputchoose to avoid overloading the notation by schemata. Note the tagging of the schemata ofincorporating the schemata per se; instead we the involved activity. We do not consider theapply a direct part-of relationship between an rejection schemata in order to avoid crowding theactivity node and the respective attributes. We picture. Also note, how the parameters of theannotate each such relationship with the name of activity are also incorporated in the architecturethe schema (by default, we assume a IN, OUT, graph. Activity SK1 has five parameters: (a) PKEY,PAR, REJ tag to denote whether the attribute which stands for the production key to bebelongs to the input, output, parameter or rejec- replaced, (b) SOURCE, which stands for an integer OUT IN OUT IN DW.PARTS DS.PS1 SK1 UPP PAR PKEY PKEY PKEY PKEY QTY QTY QTY QTY COST COST COST COST DATE DATE DATE DATE SOURCE SOURCE SOURCE SOURCE SKEY PKEY SOURCE PKEY LPKEY OUT LOOKUP SOURCE LSOURCE SKEY LSKEY Fig. 5. Part-of regulator and provider relationships of the architecture graph.
  10. 10. ARTICLE IN PRESS P. Vassiliadis et al. / Information Systems 30 (2005) 492–525 501value that characterizes which source’s data are relationship among the attributes of the involvedprocessed, (c) LPKEY, which stands for the schemata.attribute of the lookup table which contains the Formally, a provider relationship is defined byproduction keys, (d) LSOURCE, which stands for the following elements:the attribute of the lookup table which containsthe source value (corresponding to the aforemen- Name: A unique identifier for the providertioned SOURCE parameter), (e) LSKEY, which relationship.stands for the attribute of the lookup table which Mapping: An ordered pair. The first part of thecontains the surrogate keys. pair is a term (i.e., an attribute or constant) acting as a provider and the second part is an Parameters and regulator relationships: Once the attribute acting as the consumer.part-of and instantiation relationships have beenestablished, it is time to establish the regulator The mapping need not necessarily be 1:1 fromrelationships of the scenario. In this case, we link provider to consumer attributes, since an inputthe parameters of the activities to the terms attribute can be mapped to more than one(attributes or constants) that populate them. We consumer attributes. Still, the opposite does notdepict regulator relationships with simple dotted hold. Note that a consumer attribute can also beedges. populated by a constant, in certain cases. In the example of Fig. 5 we can also observe In order to achieve the flow of data from thehow the parameters of activity SK1 are populated providers of an activity towards its consumers, wethrough regulator relationships. The parameters need the following three groups of providerin and out are mapped to the respective terms relationships:through regulator relationships. All the para-meters of SK1, namely PKEY, SOURCE, LPKEY, 1. A mapping between the input schemata of theLSOURCE and LSKEY, are mapped to the respec- activity and the output schema of their datative attributes of either the activity’s input schema providers. In other words, for each attribute ofor the employed lookup table LOOKUP. The an input schema of an activity, there must existparameter LSKEY deserves particular attention. an attribute of the data provider, or a constant,This parameter is (a) populated from the attribute which is mapped to the former attribute.SKEY of the lookup table and (b) used to populate 2. A mapping between the attributes of the activitythe attribute SKEY of the output schema of the input schemata and the activity output (oractivity. Thus, two regulator relationships are rejection, respectively) schema.related with parameter LSKEY, one for each of 3. A mapping between the output or rejectionthe aforementioned attributes. The existence of a schema of the activity and the (input) schema ofregulator relationship among a parameter and an its data consumer.output attribute of an activity normally denotesthat some external data provider is employed in The mappings of the second type are internal toorder to derive a new attribute, through the the activity. Basically, they can be derived from therespective parameter. LDL statement for each of the output/rejection schemata. As far as the first and the third types of Provider relationships: The flow of data from the provider relationships are concerned, the map-data sources towards the data warehouse is pings must be provided during the construction ofperformed through the composition of activities the ETL scenario. This means that they are eitherin a larger scenario. In this context, the input for (a) by default assumed by the order of thean activity can be either a persistent data store, or attributes of the involved schemata or (b) hard-another activity. Usually, this applies for the coded by the user. Provider relationships areoutput of an activity, too. We capture the passing depicted with bold solid arrows that stem fromof data from providers to consumers by a provider the provider and end in the consumer attribute.
  11. 11. ARTICLE IN PRESS502 P. Vassiliadis et al. / Information Systems 30 (2005) 492–525 Observe Fig. 5. The flow starts from table relationships, previously discussed: the first ruleDS.PS1 of the data staging area. Each of the explains how the data from the DS.PS1 recordsetattributes of this table is mapped to an attribute of are fed into the input schema of the activity, thethe input schema of activity SK1. The attributes of second rule explains the semantics of activity (i.e.,the input schema of the latter are subsequently how the surrogate key is generated) and, finally,mapped to the attributes of the output schema of the third rule shows how the DW.PARTSUPPthe activity. The flow continues to DW.PARTSUPP. recordset is populated from the output schema ofAnother interesting thing is that during the data the activity SK1.flow, new attributes are generated, resulting on newstreams of data, whereas the flow seems to stop for Derived provider relationships: As we haveother attributes. Observe the rightmost part of already mentioned, there are certain outputFig. 5 where the values of attribute PKEY are not attributes that are computed through the composi-further propagated (remember that the reason for tion of input attributes and parameters. A derivedthe application of a surrogate key transformation is provider relationship is another form of providerto replace the production keys of the source data to relationship that captures the flow from the inputa homogeneous surrogate for the records of the to the respective output warehouse, which is independent of the source Formally, assume that (a) source is a term inthey have been collected from). Instead of the the architecture graph, (b) target is an attributevalues of the production key, the values from the of the output schema of an activity A and (c) x,yattribute SKEY will be used to denote the unique are parameters in the parameter list of A (notidentifier for a part in the rest of the flow. necessary different). Then, a derived provider In Fig. 6, we depict the LDL definition of this relationship pr(source, target) exists iff thepart of the motivating example. The three rules following regulator relationships (i.e., edges) exist:correspond to the three categories of provider rr1(source, x) and rr2(y, target). addSkey_in1(A_IN1_PKEY,A_IN1_DATE,A_IN1_QTY,A_IN1_COST,A_IN1_SOURCE) ds_ps1(A_OUT_PKEY,A_OUT_DATE,A_OUT_QTY,A_OUT_COST,A_OUT_SOURCE), A_OUT_PKEY=A_IN1_PKEY, A_OUT_DATE=A_IN1_DATE, A_OUT_QTY=A_IN1_QTY, A_OUT_COST=A_IN1_COST, A_OUT_SOURCE=A_IN1_SOURCE. addSkey_out(A_OUT_PKEY,A_OUT_DATE,A_OUT_QTY,A_OUT_COST,A_OUT_SOURCE,A_OUT_SKEY) addSkey_in1(A_IN1_PKEY,A_IN1_DATE,A_IN1_QTY,A_IN1_COST,A_IN1_SOURCE), lookup(A_IN1_SOURCE,A_IN1_PKEY,A_OUT_SKEY), A_OUT_PKEY=A_IN1_PKEY, A_OUT_DATE=A_IN1_DATE, A_OUT_QTY=A_IN1_QTY, A_OUT_COST=A_IN1_COST, A_OUT_SOURCE=A_IN1_SOURCE. dw_partsupp(PKEY,DATE,QTY,COST,SOURCE) addSkey_out(A_OUT_PKEY,A_OUT_DATE,A_OUT_QTY,A_OUT_COST,A_OUT_SOURCE,A_OUT_SKEY), DATE=A_IN1_DATE, QTY=A_IN1_QTY, COST=A_IN1_COST SOURCE=A_IN1_SOURCE, PKEY=A_IN1_SKEY. NOTE: For reasonsof readability we do not replace the Ain attribute names with the activity name, i.e.,A_OUT_PKEYshould be diffPS1_OUT_PKEY. Fig. 6. LDL specification of the motivating example.
  12. 12. ARTICLE IN PRESS P. Vassiliadis et al. / Information Systems 30 (2005) 492–525 503 IN OUT SK1 PAR PKEY PKEY SOURCE SOURCE IN OUT SK1 SKEY PAR PKEY PKEY SOURCE SOURCE SKEY PKEY PKEY LPKEY OUT LOOKUP SOURCE OUT LOOKUP SOURCE LSOURCE SKEY SKEY LSKEYFig. 7. Derived provider relationships of the architecture graph: the original situation on the left and the derived provider relationshipson the right. Intuitively, the case of derived relationships ships that do not involve constants (remember thatmodels the situation where the activity computes we have defined source as a term); (b) relation-a new attribute in its output. In this case, the ships involving only attributes of the same/produced output depends on all the attributes that different activity (as a measure of internal com-populate the parameters of the activity, resulting plexity or external dependencies); (c) relationshipsin the definition of the corresponding derived relating attributes that populate only the samerelationship. parameter (e.g., only the attributes LOOKUP.SKEY Observe Fig. 7, where we depict a small part of and SK1.OUT.SKEY).our running example. The left side of the figuredepicts the situation where only provider relation- 2.5. Scenariosships exist. The legend in the right side of Fig. 7depicts how we compute the derived provider A scenario is an enumeration of activities alongrelationships between the parameters of the with their source/target recordsets and the respec-activity and the computed output attribute SKEY. tive provider relationships for each activity. AnThe meaning of these five relationships is that ETL scenario consists of the following elements:SK1.OUT.SKEY is not computed only fromattribute LOOKUP.SKEY, but from the combina- Name: A unique identifier for the scenario.tion of all the attributes that populate the Activities: A finite list of activities. Note that byparameters. employing a list (instead of e.g., a set) of One can also assume different variations of activities, we impose a total ordering on thederived provider relationships such as (a) relation- execution of the scenario.
  13. 13. ARTICLE IN PRESS504 P. Vassiliadis et al. / Information Systems 30 (2005) 492–525 Entity Model-specific Scenario-specific Data Types DI D Built -in Function Types FI F Constants CI C Attributes ΩI Ω Functions ΦI Φ Schemata SI S User-provided RecordSets RSI RS Activities AI A Provider Relationships PrI Pr Part-Of Relationships PoI Po Instance-Of Relationships IoI Io Regulator Relationships RrI Rr Derived Provider Relationships DrI Dr Fig. 8. Formal definition of domains and notation. Recordsets: A finite set of recordsets. In the sequel, we treat the terms architecture Targets: A special-purpose subset of the record- graph and scenario interchangeably. The reason- sets of the scenario, which includes the final ing for the term ‘architecture graph’ goes all the destinations of the overall process (i.e., the data way down to the fundamentals of conceptual warehouse tables that must be populated by the modeling. As mentioned in [12], conceptual activities of the scenario). models are the means by which designers conceive, Provider relationships: A finite list of provider architect, design, and build software systems. relationships among activities and recordsets of These conceptual models are used in the same the scenario. way that blueprints are used in other engineering disciplines during the early stages of the lifecycle of In our modeling, a scenario is a set of activities, artificial systems, which involves the creation of their architecture. The term ‘architecture graph’deployed along a graph in an execution sequence expresses the fact that the graph that we employthat can be linearly serialized. For the moment, we for the modeling of the data flow of the ETLdo not consider the different alternatives for the scenario is practically acting as a blueprint of theordering of the execution; we simply require that a architecture of this software order for this execution is present (i.e., each Moreover, we assume the following integrityactivity has a discrete execution priority). constraints for a scenario: In terms of formal modeling of the architecturegraph, we assume the infinitely countable, mu- Static constraints:tually disjoint sets of names (i.e., the values ofwhich respect the unique name assumption) of All the weak entities of a scenario (i.e.,column model-specific in Fig. 8. As far as a specific attributes or parameters) should be definedscenario is concerned, we assume their respective within a part-of relationship (i.e., they shouldfinite subsets, depicted in column scenario-specific have a container object).in Fig. 8. Data types, function types and constants All the mappings in provider relationshipsare considered built-in’s of the system, whereas the should be defined among terms (i.e., attributesrest of the entities are provided by the user (user or constants) of the same data type.provided). Formally, the architecture graph of an ETLscenario is a graph G(V,E) defined as follows: Data flow constraints: V ¼ D[F[C[X[/[S[RS[A All the attributes of the input schema(ta) of an E ¼ Pr[Po[Io[Rr[Dr. activity should have a provider.
  14. 14. ARTICLE IN PRESS P. Vassiliadis et al. / Information Systems 30 (2005) 492–525 505 Resulting from the previous requirement, if 3.1. General framework some attribute is a parameter in an activity A, the container of the attribute (i.e., recordset or Our philosophy during the construction of our activity) should precede A in the scenario. metamodel was based on two pillars: (a) genericity, All the attributes of the schemata of the target i.e., the derivation of a simple model, powerful to recordsets should have a data provider. capture ideally all the cases of ETL activities and (b) extensibility, i.e., the possibility of extending Summarizing, in this section, we have presented the built-in functionality of the system with new,a generic model for the modeling of the data flow user-specific templates.for ETL workflows. In the next section, we will The genericity doctrine was pursued through theproceed to detail how this generic model can be definition of a rather simple activity metamodel, asaccompanied by a customization mechanism, in described in Section 2. Still, providing a singleorder to provide higher flexibility to the designer metaclass for all the possible activities of an ETLof the workflow. environment is not really enough for the designer of the overall process. A richer ‘‘language’’ should be available, in order to describe the structure of3. Templates for ETL activities the process and facilitate its construction. To this end, we provide a palette of template activities, In this section, we present the mechanism for which are specializations of the generic metamodelexploiting template definitions of frequently used class.ETL activities. The general framework for the Observe Fig. 9 for a further explanation of ourexploitation of these templates is accompanied framework. The lower layer of Fig. 9, namelywith the presentation of the language-related schema layer, involves a specific ETL scenario.issues for template management and appropriate All the entities of the schema layer are instances ofexamples. the classes Data Type, Function Type, Datatypes Functions Elementary Activity RecotdSet Relationships Metamodel Layer IsA Domain Mismatch Source Table Provider Re NotNull SK Assignment Fact Table Template Layer InstanceOf S1.PARTSUPF NN DM1 SK1 DW.PARTSUPP Schema Layer Fig. 9. The metamodel for the logical entities of the ETL environment.
  15. 15. ARTICLE IN PRESS506 P. Vassiliadis et al. / Information Systems 30 (2005) 492–525Elementary Activity, RecordSet and In the example of Fig. 9 the concept DW.Relationship. Thus, as one can see on the PARTSUPP must be populated from a certainupper part of Fig. 9, we introduce a meta-class source S1.PARTSUPP. Several operations mustlayer, namely metamodel layer involving the intervene during the propagation. For instance inaforementioned classes. The linkage between the Fig. 9, we check for null values and domainmetamodel and the schema layers is achieved violations, and we assign a surrogate key. As onethrough instantiation (InstanceOf) relation- can observe, the recordsets that take part in thisships. The metamodel layer implements the afore- scenario are instances of class RecordSet (be-mentioned genericity desideratum: the classes longing to the metamodel layer) and specifically ofwhich are involved in the metamodel layer are its subclasses Source Table and Fact Table.generic enough to model any ETL scenario, Instances and encompassing classes are relatedthrough the appropriate instantiation. through links of type InstanceOf. The same Still, we can do better than the simple provision mechanism applies to all the activities ofof a metalayer and an instance layer. In order to the scenario, which are (a) instances of classmake our metamodel truly useful for practi- Elementary Activity and (b) instances ofcal cases of ETL activities, we enrich it with a set one of its subclasses, depicted in Fig. 9. Relation-of ETL-specific constructs, which constitute a ships do not escape this rule either. For instance,subset of the larger metamodel layer, namely observe how the provider links from the conceptthe template layer. The constructs in the template S1.PS toward the concept DW.PARTSUPP arelayer are also meta-classes, but they are related to class Provider Relationshipquite customized for the regular cases of ETL through the appropriate InstanceOf links.activities. Thus, the classes of the template layer As far as the class Recordset is concerned, inare specializations (i.e., subclasses) of the generic the template layer we can specialize it to severalclasses of the metamodel layer (depicted as subclasses, based on orthogonal characteristics,IsA relationships in Fig. 9). Through this custo- such as whether it is a file or RDBMS table, ormization mechanism, the designer can pick the whether it is a source or target data store (as ininstances of the schema layer from a much Fig. 9). In the case of the class Relationship,richer palette of constructs; in this setting, the there is a clear specialization in terms of the fiveentities of the schema layer are instantiations, not classes of relationships which have alreadyonly of the respective classes of the metamodel been mentioned in Section 2 (i.e., Provider,layer, but also of their subclasses in the template Part-Of, Instance-Of, Regulator andlayer. Derived Provider). Filters Unary operations Binary operations - Selection (σ) - Push - Union (U) - Not null (NN) - Aggregation (γ) - Join ( ) ∆ ∆ - Primary key - Projection (Π) - Diff (∆) violation (PK) - Function application (f) - Update Detection (∆UPD) - Foreign key - Surrogate key assignment (SK) violation (FK) - Tuple normalization (N) - Unique value (UN) - Tuple denormalization(DN) - Domain mismatch (DM) File operations Transfer operations - EBCDIC to ASCII conversion - Ftp (FTP) (EB2AS) - Compress / Decompress (Z/dZ) - Sort file (Sort) - Encrypt / Decrypt (Cr/dCr) Fig. 10. Template activities, along with their graphical notation symbols, grouped by category.
  16. 16. ARTICLE IN PRESS P. Vassiliadis et al. / Information Systems 30 (2005) 492–525 507 Following the same framework, class Elemen- frequent elements of ETL scenarios. Moreover,tary Activity is further specialized to an apart from this ‘‘built-in’’, ETL-specific extensionextensible set of reoccurring patterns of ETL of the generic metamodel, if the designer decidesactivities, depicted in Fig. 10. As one can see on that several ‘patterns’, not included in the palettethe top side of Fig. 9, we group the template of the template layer, occur repeatedly in his dataactivities in five major logical groups. We do not warehousing projects, he can easily fit them intodepict the grouping of activities in subclasses in the customizable template layer through a specia-Fig. 9, in order to avoid overloading the figure; lization mechanism.instead, we depict the specialization of classElementary Activity to three of its subclasses 3.2. Formal definition and usage of templatewhose instances appear in the employed scenario activitiesof the schema layer. We now proceed to presenteach of the aforementioned groups in more detail. Once the template layer has been introduced, The first group, named filters, provides checks the obvious issue that is raised is its linkage withfor the satisfaction (or not) of a certain condition. the employed declarative language of our frame-The semantics of these filters are the obvious work. In general, the broader issue is the usage of(starting from a generic selection condition the template mechanism from the user; to this end,and proceeding to the check for null values, we will explain the substitution mechanism forprimary or foreign key violation, etc.). templates in this subsection and refer the interestedThe second group of template activities is called reader to [13] for a presentation of the specificunary operations and except for the most generic templates that we have constructed.push activity (which simply propagates data from A template activity is formally defined by thethe provider to the consumer), consists of the following elements:classical aggregation and function appli-cation operations along with three data ware- Name: A unique identifier for the templatehouse specific transformations (surrogate key activity.assignment, normalization and denorma- Parameter list: A set of names which act aslization). The third group consists of classical regulators in the expression of the semantics ofbinary operations, such as union, join and the template activity. For example, the para-difference of recordsets/activities as well as meters are used to assign values to constants,with a special case of difference involving the create dynamic mapping at instantiation time,detection of updates. Except for the afore- etc.mentioned template activities, which mainly refer Expression: A declarative statement describingto logical transformations, we can also consider the operation performed by the instances of thethe case of physical operators that refer to the template activity. As with elementary activities,application of physical transformations to whole our model supports LDL as the formalism forfiles/tables. In the ETL context, we are mainly the expression of this statement.interested in operations like transfer operations Mapping: A set of bindings, mapping input to(ftp, compress/decompress, encrypt/ output attributes, possibly through intermediatedecrypt) and file operations (EBCDIC to AS- placeholders. In general, mappings at theCII, sort file). template level try to capture a default way of Summarizing, the metamodel layer is a set of propagating incoming values from the inputgeneric entities, able to represent any ETL towards the output schema. These defaultscenario. At the same time, the genericity of the bindings are easily refined and possibly rear-metamodel layer is complemented with the exten- ranged at instantiation time.sibility of the template layer, which is a set of‘‘built-in’’ specializations of the entities of the The template mechanism we use is a substitutionmetamodel layer, specifically tailored for the most mechanism, based on macros, that facilitates the
  17. 17. ARTICLE IN PRESS508 P. Vassiliadis et al. / Information Systems 30 (2005) 492–525automatic creation of LDL code. This simple which returns the arity of the respective schema,notation and instantiation mechanism permits the mainly in order to define upper bounds in loopeasy and fast registration of LDL templates. In the of this section, we will elaborate on thenotation, instantiation mechanisms and template Loops: Loops are a powerful mechanism thattaxonomy particularities. enhances the genericity of the templates by allowing the designer to handle templates with3.2.1. Notation unknown number of variables and with unknown Our template notation is a simple language arity for the input/output schemata. The generalfeaturing five main mechanisms for dynamic form of loops isproduction of LDL expressions: (a) variables that ½hsimple constraintiŠfhloop bodyig;are replaced by their values at instantiationtime; (b) a function that returns the arity of an where simple constraint has the form:input, output or parameter schema; (c) loops, hlower boundi hcomparison operatori hiteratoriwhere the loop body is repeated at instantiation hcomparison operatori hupper boundi:time as many times as the iterator constraintdefines; (d) keywords to simplify the creation We consider only linear increase with step equalof unique predicate and attribute names; and, to 1, since this covers most possible cases. Upperfinally, (e) macros which are used as syntactic bound and lower bound can be arithmeticsugar to simplify the way we handle complex expressions involving arityOf() function calls,expressions (especially in the case of variable size variables and constants. Valid arithmetic opera-schemata). tors are +, À, /, * and valid comparison operators are o, 4, ¼ , all with their usual semantics. If Variables: We have two kinds of variables in the lower bound is omitted, 1 is assumed. During eachtemplate mechanism: parameter variables and loop iteration the loop body will be reproduced and atiterators. Parameter variables are marked with a @ the same time all the marked appearances of thesymbol at their beginning and they are replaced by loop iterator will be replaced by its current value,user-defined values at instantiation time. A list of as described before. Loop nesting is arbitrary length of parameters is denoted by@/parameter nameS[ ]. For such lists, the Keywords: Keywords are used in order to referuser has to explicitly or implicitly provide their to input and output schemata. They provide twolength at instantiation time. Loop iterators, on the main functionalities: (a) they simplify the referenceother hand, are implicitly defined in the loop to the input output/schema by using standardconstraint. During each loop iteration, all the names for the predicates and their attributes, andproperly marked appearances of the iterator in the (b) they allow their renaming at instantiation time.loop body are replaced by its current value This is done in such a way that no different(similarly to the way the C preprocessor treats predicates with the same name will appear in the#DEFINE statements). Iterators that appear same program, and no different attributes with themarked in loop body are instantiated even when same name will appear in the same rule. Keywordsthey are a part of another string or of a variable are recognized even if they are parts of anothername. We mark such appearances by enclosing string, without a special notation. This facilitates athem with $. This functionality enables referencing homogenous renaming of multiple distinct inputall the values of a parameter list and facilitates the schemata at template level, to multiple distinctcreation of an arbitrary number of pre-formatted schemata at instantiation, with all of them havingstrings. unique names in the LDL program scope. For example, if the template is expressed in terms of Functions: We employ a built-in function, ari- two different input schemata a_in1 and a_in2,tyOf(/input/output/parameter schemaS ), at instantiation time they will be renamed to
  18. 18. ARTICLE IN PRESS P. Vassiliadis et al. / Information Systems 30 (2005) 492–525 509 Keyword Usage Example A unique name for the output/input schema a_out of the activity. The predicate that is difference3_out produced when this template is instantiated a_in has the form: difference3_in unique_pred_name_out (or, _in respectively) A_OUT/A_IN is used for constructing the names of the a_out/a_in attributes. The names A_OUT DIFFERENCE3_OUT produced have the form: A_IN DIFFERENCE3_IN predicate unique name in upper case_OUT (or, _IN respectively) Fig. 11. Keywords for templates.dm1_in1 and dm1_in2 so that the produced definition of a template for a simple relationalnames will be unique throughout the scenario selection:program. In Fig. 11, we depict the way the a_out([ioarityOf(a_out)]{A_OUT_$i$,}renaming is performed at instantiation time. [i ¼ arityOf(a_out)]{A_OUT_$i$}) o- a_in1([ioarityOf(a_in1)] Macros: To make the definition of templates {A_IN1_$i$,} [i ¼ arityOf(a_in1)]easier and to improve their readability, we {A_IN1_$i$}),introduce a macro to facilitate attribute and expr([ioarityOf(@PARAM)]variable name expansion. For example, one of {@PARAM[$i$],}the major problems in defining a language for [i ¼ arityOf(@PARAM)]templates is the difficulty of dealing with schemata {@PARAM[$i$]}),of arbitrary arity. Clearly, at the template level, it [ioarityOf(a_out)]is not possible to pin-down the number of {A_OUT_$i$ ¼ A_IN1_$i$,}attributes of the involved schemata to a specific [i ¼ arityOf(a_out)]value. For example, in order to create a series of {A_OUT_$i$ ¼ A_IN1_$i$}names like the following As already mentioned at the syntax for loops, the name_theme_1,name_theme_2,y, expression name_theme_k [ioarityOf(a_out)]{A_OUT_$i$,}we need to give the following expression: [i ¼ arityOf(a_out)]{A_OUT_$i$} [iteratoromaxLimit] defining the attributes of the output schema {name_theme$iterator$} a_out simply wants to list a variable number of [iterator ¼ maxLimit] attributes that will be fixed at instantiation time. {name_theme$iterator$} Exactly the same tactics apply for the attributes of the predicate names a_in1 and expr. Also, the Obviously, this results in making the writing of final two lines state that each attribute of thetemplates hard and reduces their readability. To output will be equal to the respective attribute ofattack this problem, we resort to a simple reusable the input (so that the query is safe), e.g.,macro mechanism that enables the simplification A_OUT_4 ¼ A_IN1_4. We can simplify theof employed expressions. For example, observe the definition of the template by allowing the designer
  19. 19. ARTICLE IN PRESS510 P. Vassiliadis et al. / Information Systems 30 (2005) 492–525to define certain macros that simplify the manage- 4. All the rest parameter variables are instantiated.ment of temporary length attribute lists. We 5. Keywords are recognized and renamed.employ the following macros: We will try to explain briefly the intuition DEFINE INPUT_SCHEMA AS behind this execution order. Macros are expanded [ioarityOf(a_in1)]{A_IN1_$i$,} first. Step (2) proceeds step (3) because loop [i ¼ arityOf(a_in1)] {A_IN1_$i$} boundaries have to be calculated before loop DEFINE OUTPUT_SCHEMA AS productions are performed. Loops on the other [ioarityOf(a_in)]{A_OUT_$i$,} hand, have to be expanded before parameter [i ¼ arityOf(a_out)]{A_OUT_$i$} variables are instantiated, if we want to be able DEFINE PARAM_SCHEMA AS to reference lists of variables. The only exception [ioarityOf(@PARAM)]{@PARAM[$i$],} to this is the parameter variables that appear in the [i ¼ arityOf(@PARAM)]{@PARAM[$i$]} loop boundaries, which have to be calculated first. DEFINE DEFAULT_MAPPING AS Notice though, that variable list elements cannot [ioarityOf(a_out)] appear in the loop constraint. Finally, we have to {A_OUT_$i$ ¼ A_IN1_$i$,} instantiate variables before keywords since vari- [i ¼ arityOf(a_out)] ables are used to create a dynamic mapping {A_OUT_$i$ ¼ A_IN1_$i$} between the input/output schemata and other attributes. Then, the template definition is as follows: Fig. 12 shows a simple example of template a_out(OUTPUT_SCHEMA) o- instantiation for the function application activity. a_in1(INPUT_SCHEMA), To understand the overall process better, first expr(PARAM_SCHEMA), observe the outcome of it, i.e., the specific activity DEFAULT_MAPPING which is produced, as depicted in the final row of Fig. 12, labeled keyword renaming. The output schema of the activity, fa12_out, is the head of the LDL rule that specifies the activity. The body3.2.2. Instantiation of the rule says that the output records are Template instantiation is the process where the specified by the conjunction of the followinguser chooses a certain template and creates a clauses: (a) the input schema myFunc_in, (b)concrete activity out of it. This procedure requires the application of function subtract over thethat the user specifies the schemata of the activity attributes COST_IN, PRICE_IN and the produc-and gives concrete values to the template para- tion of a value PROFIT, and (c) the mapping ofmeters. Then, the process of producing therespective LDL description of the activity is easily the input to the respective output attributes as specified in the last three conjuncts of the rule.automated. Instantiation order is important in our The first row, template, shows the initialtemplate creation mechanism, since, as it can easily template as it has been registered by the designer.been seen from the notation definitions, different @FUNCTION holds the name of the function to beorders can lead to different results. The instantia- used, subtract in our case, and the @PARAM[ ]tion order is as follows: holds the inputs of the function, which in our case1. Replacement of macro definitions with their are the two attributes of the input schema. The expansions. problem we have to face is that all input, output2. arityOf() functions and parameter variables and function schemata have a variable number of appearing in loop boundaries are calculated parameters. To abstract from the complexity of first. this problem, we define four macro definitions, one3. Loop productions are performed by instantiat- for each schema (INPUT_SCHEMA, OUTPUT_ ing the appearances of the iterators. This leads SCHEMA, FUNCTION_INPUT) along with a macro to intermediate results without any loops. for the mapping of input to output attributes
  20. 20. ARTICLE IN PRESS P. Vassiliadis et al. / Information Systems 30 (2005) 492–525 511 Fig. 12. Instantiation procedure.(DEFAULT_MAPPING). The second row, macro also shown in the last two lines of the template. Inexpansion, shows how the template looks after the the third row, parameter instantiation, we can seemacros have been incorporated in the template how the parameter variables were materialized atdefinition. The mechanics of the expansion are instantiation. In the fourth row, loop production,straightforward: observe how the attributes of the we can see the intermediate results after the loopoutput schema are specified by the expression expansions are done. As it can easily be seen, these[ioarityOf(a_in)+1]{A_OUT_$i$,}OUT- expansions must be done before @PARAM[]FIELD as an expansion of the macro OUTPUT_ variables are replaced by their values. In the fifthSCHEMA. In a similar fashion, the attributes of the row, variable instantiation, the parameter variablesinput schema and the parameters of the function have been instantiated creating a default mappingare also specified; note that the expression for the between the input, the output and the functionlast attribute in the list is different (to avoid attributes. Finally, in the last row, keywordrepeating an erroneous comma). The mappings renaming, the output LDL code is presented afterbetween the input and the output attributes are the keywords are renamed. Keyword instantiation