This document presents a framework for designing ETL (extraction-transformation-loading) scenarios. The framework includes a metamodel for defining ETL activities and their relationships. It employs a declarative language to define the semantics of each activity. The framework is generic but includes predefined "templates" for common ETL activities to promote reusability. The design concepts have been implemented in a tool called ARKTOS II, which is also presented.
Databricks CEO Ali Ghodsi introduces Databricks Delta, a new data management system that combines the scale and cost-efficiency of a data lake, the performance and reliability of a data warehouse, and the low latency of streaming.
Making Data Timelier and More Reliable with Lakehouse TechnologyMatei Zaharia
Enterprise data architectures usually contain many systems—data lakes, message queues, and data warehouses—that data must pass through before it can be analyzed. Each transfer step between systems adds a delay and a potential source of errors. What if we could remove all these steps? In recent years, cloud storage and new open source systems have enabled a radically new architecture: the lakehouse, an ACID transactional layer over cloud storage that can provide streaming, management features, indexing, and high-performance access similar to a data warehouse. Thousands of organizations including the largest Internet companies are now using lakehouses to replace separate data lake, warehouse and streaming systems and deliver high-quality data faster internally. I’ll discuss the key trends and recent advances in this area based on Delta Lake, the most widely used open source lakehouse platform, which was developed at Databricks.
The document discusses strategies for managing master data through a Master Data Management (MDM) solution. It outlines challenges with current data management practices and goals for an improved MDM approach. Key considerations for implementing an effective MDM strategy include identifying initial data domains, use cases, source systems, consumers, and the appropriate MDM patterns to address business needs.
Doug Bateman, a principal data engineering instructor at Databricks, presented on how to build a Lakehouse architecture. He began by introducing himself and his background. He then discussed the goals of describing key Lakehouse features, explaining how Delta Lake enables it, and developing a sample Lakehouse using Databricks. The key aspects of a Lakehouse are that it supports diverse data types and workloads while enabling using BI tools directly on source data. Delta Lake provides reliability, consistency, and performance through its ACID transactions, automatic file consolidation, and integration with Spark. Bateman concluded with a demo of creating a Lakehouse.
Introduction SQL Analytics on Lakehouse ArchitectureDatabricks
This document provides an introduction and overview of SQL Analytics on Lakehouse Architecture. It discusses the instructor Doug Bateman's background and experience. The course goals are outlined as describing key features of a data Lakehouse, explaining how Delta Lake enables a Lakehouse architecture, and defining features of the Databricks SQL Analytics user interface. The course agenda is then presented, covering topics on Lakehouse Architecture, Delta Lake, and a Databricks SQL Analytics demo. Background is also provided on Lakehouse architecture, how it combines the benefits of data warehouses and data lakes, and its key features.
This document discusses metadata and the importance of metadata management. It introduces Apache Atlas as an open source platform for metadata management and governance. Key points include:
- Metadata is important for data reuse, analytics, and governance. It provides context and meaning about data.
- Current reality is that metadata is often not well supported or integrated across tools. Apache Atlas aims to provide an open, unified approach.
- Apache Atlas has graduated to a top-level Apache project. It provides a type-agnostic metadata store and interfaces that can be accessed by various tools.
- The vision is for an open ecosystem where metadata is shared and federated across repositories from different vendors and tools.
Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks
Data Mesh is an innovative concept addressing many data challenges from an architectural, cultural, and organizational perspective. But is the world ready to implement Data Mesh?
In this session, we will review the importance of core Data Mesh principles, what they can offer, and when it is a good idea to try a Data Mesh architecture. We will discuss common challenges with implementation of Data Mesh systems and focus on the role of open-source projects for it. Projects like Apache Spark can play a key part in standardized infrastructure platform implementation of Data Mesh. We will examine the landscape of useful data engineering open-source projects to utilize in several areas of a Data Mesh system in practice, along with an architectural example. We will touch on what work (culture, tools, mindset) needs to be done to ensure Data Mesh is more accessible for engineers in the industry.
The audience will leave with a good understanding of the benefits of Data Mesh architecture, common challenges, and the role of Apache Spark and other open-source projects for its implementation in real systems.
This session is targeted for architects, decision-makers, data-engineers, and system designers.
Azure DataBricks for Data Engineering by Eugene PolonichkoDimko Zhluktenko
This document provides an overview of Azure Databricks, a Apache Spark-based analytics platform optimized for Microsoft Azure cloud services. It discusses key components of Azure Databricks including clusters, workspaces, notebooks, visualizations, jobs, alerts, and the Databricks File System. It also outlines how data engineers can leverage Azure Databricks for scenarios like running ETL pipelines, streaming analytics, and connecting business intelligence tools to query data.
Databricks CEO Ali Ghodsi introduces Databricks Delta, a new data management system that combines the scale and cost-efficiency of a data lake, the performance and reliability of a data warehouse, and the low latency of streaming.
Making Data Timelier and More Reliable with Lakehouse TechnologyMatei Zaharia
Enterprise data architectures usually contain many systems—data lakes, message queues, and data warehouses—that data must pass through before it can be analyzed. Each transfer step between systems adds a delay and a potential source of errors. What if we could remove all these steps? In recent years, cloud storage and new open source systems have enabled a radically new architecture: the lakehouse, an ACID transactional layer over cloud storage that can provide streaming, management features, indexing, and high-performance access similar to a data warehouse. Thousands of organizations including the largest Internet companies are now using lakehouses to replace separate data lake, warehouse and streaming systems and deliver high-quality data faster internally. I’ll discuss the key trends and recent advances in this area based on Delta Lake, the most widely used open source lakehouse platform, which was developed at Databricks.
The document discusses strategies for managing master data through a Master Data Management (MDM) solution. It outlines challenges with current data management practices and goals for an improved MDM approach. Key considerations for implementing an effective MDM strategy include identifying initial data domains, use cases, source systems, consumers, and the appropriate MDM patterns to address business needs.
Doug Bateman, a principal data engineering instructor at Databricks, presented on how to build a Lakehouse architecture. He began by introducing himself and his background. He then discussed the goals of describing key Lakehouse features, explaining how Delta Lake enables it, and developing a sample Lakehouse using Databricks. The key aspects of a Lakehouse are that it supports diverse data types and workloads while enabling using BI tools directly on source data. Delta Lake provides reliability, consistency, and performance through its ACID transactions, automatic file consolidation, and integration with Spark. Bateman concluded with a demo of creating a Lakehouse.
Introduction SQL Analytics on Lakehouse ArchitectureDatabricks
This document provides an introduction and overview of SQL Analytics on Lakehouse Architecture. It discusses the instructor Doug Bateman's background and experience. The course goals are outlined as describing key features of a data Lakehouse, explaining how Delta Lake enables a Lakehouse architecture, and defining features of the Databricks SQL Analytics user interface. The course agenda is then presented, covering topics on Lakehouse Architecture, Delta Lake, and a Databricks SQL Analytics demo. Background is also provided on Lakehouse architecture, how it combines the benefits of data warehouses and data lakes, and its key features.
This document discusses metadata and the importance of metadata management. It introduces Apache Atlas as an open source platform for metadata management and governance. Key points include:
- Metadata is important for data reuse, analytics, and governance. It provides context and meaning about data.
- Current reality is that metadata is often not well supported or integrated across tools. Apache Atlas aims to provide an open, unified approach.
- Apache Atlas has graduated to a top-level Apache project. It provides a type-agnostic metadata store and interfaces that can be accessed by various tools.
- The vision is for an open ecosystem where metadata is shared and federated across repositories from different vendors and tools.
Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks
Data Mesh is an innovative concept addressing many data challenges from an architectural, cultural, and organizational perspective. But is the world ready to implement Data Mesh?
In this session, we will review the importance of core Data Mesh principles, what they can offer, and when it is a good idea to try a Data Mesh architecture. We will discuss common challenges with implementation of Data Mesh systems and focus on the role of open-source projects for it. Projects like Apache Spark can play a key part in standardized infrastructure platform implementation of Data Mesh. We will examine the landscape of useful data engineering open-source projects to utilize in several areas of a Data Mesh system in practice, along with an architectural example. We will touch on what work (culture, tools, mindset) needs to be done to ensure Data Mesh is more accessible for engineers in the industry.
The audience will leave with a good understanding of the benefits of Data Mesh architecture, common challenges, and the role of Apache Spark and other open-source projects for its implementation in real systems.
This session is targeted for architects, decision-makers, data-engineers, and system designers.
Azure DataBricks for Data Engineering by Eugene PolonichkoDimko Zhluktenko
This document provides an overview of Azure Databricks, a Apache Spark-based analytics platform optimized for Microsoft Azure cloud services. It discusses key components of Azure Databricks including clusters, workspaces, notebooks, visualizations, jobs, alerts, and the Databricks File System. It also outlines how data engineers can leverage Azure Databricks for scenarios like running ETL pipelines, streaming analytics, and connecting business intelligence tools to query data.
Data mesh is a decentralized approach to managing and accessing analytical data at scale. It distributes responsibility for data pipelines and quality to domain experts. The key principles are domain-centric ownership, treating data as a product, and using a common self-service infrastructure platform. Snowflake is well-suited for implementing a data mesh with its capabilities for sharing data and functions securely across accounts and clouds, with built-in governance and a data marketplace for discovery. A data mesh implemented on Snowflake's data cloud can support truly global and multi-cloud data sharing and management according to data mesh principles.
The document discusses Azure Data Factory v2. It provides an agenda that includes topics like triggers, control flow, and executing SSIS packages in ADFv2. It then introduces the speaker, Stefan Kirner, who has over 15 years of experience with Microsoft BI tools. The rest of the document consists of slides on ADFv2 topics like the pipeline model, triggers, activities, integration runtimes, scaling SSIS packages, and notes from the field on using SSIS packages in ADFv2.
Part IV covers the content framework, architecture repository and metamodel. Part V discusses the Enterprise Continuum and Tools Requirements. Part VI covers two example reference models, while Part VII looks at the Enterprise Architecture Capability Framework.
Activate Data Governance Using the Data CatalogDATAVERSITY
This document discusses activating data governance using a data catalog. It compares active vs passive data governance, with active embedding governance into people's work through a catalog. The catalog plays a key role by allowing stewards to document definition, production, and usage of data in a centralized place. For governance to be effective, metadata from various sources must be consolidated and maintained in the catalog.
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...Databricks
Many had dubbed 2020 as the decade of data. This is indeed an era of data zeitgeist.
From code-centric software development 1.0, we are entering software development 2.0, a data-centric and data-driven approach, where data plays a central theme in our everyday lives.
As the volume and variety of data garnered from myriad data sources continue to grow at an astronomical scale and as cloud computing offers cheap computing and data storage resources at scale, the data platforms have to match in their abilities to process, analyze, and visualize at scale and speed and with ease — this involves data paradigm shifts in processing and storing and in providing programming frameworks to developers to access and work with these data platforms.
In this talk, we will survey some emerging technologies that address the challenges of data at scale, how these tools help data scientists and machine learning developers with their data tasks, why they scale, and how they facilitate the future data scientists to start quickly.
In particular, we will examine in detail two open-source tools MLflow (for machine learning life cycle development) and Delta Lake (for reliable storage for structured and unstructured data).
Other emerging tools such as Koalas help data scientists to do exploratory data analysis at scale in a language and framework they are familiar with as well as emerging data + AI trends in 2021.
You will understand the challenges of machine learning model development at scale, why you need reliable and scalable storage, and what other open source tools are at your disposal to do data science and machine learning at scale.
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsKhalid Salama
In essence, a data lake is commodity distributed file system that acts as a repository to hold raw data file extracts of all the enterprise source systems, so that it can serve the data management and analytics needs of the business. A data lake system provides means to ingest data, perform scalable big data processing, and serve information, in addition to manage, monitor and secure the it environment. In these slide, we discuss building data lakes using Azure Data Factory and Data Lake Analytics. We delve into the architecture if the data lake and explore its various components. We also describe the various data ingestion scenarios and considerations. We introduce the Azure Data Lake Store, then we discuss how to build Azure Data Factory pipeline to ingest the data lake. After that, we move into big data processing using Data Lake Analytics, and we delve into U-SQL.
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Cloudera, Inc.
Speaker: Hari Shreedharan
Data Day Texas 2015
Apache Spark has emerged over the past year as the imminent successor to Hadoop MapReduce. Spark can process data in memory at very high speed, while still be able to spill to disk if required. Spark’s powerful, yet flexible API allows users to write complex applications very easily without worrying about the internal workings and how the data gets processed on the cluster.
Spark comes with an extremely powerful Streaming API to process data as it is ingested. Spark Streaming integrates with popular data ingest systems like Apache Flume, Apache Kafka, Amazon Kinesis etc. allowing users to process data as it comes in.
In this talk, Hari will discuss the basics of Spark Streaming, its API and its integration with Flume, Kafka and Kinesis. Hari will also discuss a real-world example of a Spark Streaming application, and how code can be shared between a Spark application and a Spark Streaming application. Each stage of the application execution will be presented, which can help understand practices while writing such an application. Hari will finally discuss how to write a custom application and a custom receiver to receive data from other systems.
First introduced with the Analytics Platform System (APS), PolyBase simplifies management and querying of both relational and non-relational data using T-SQL. It is now available in both Azure SQL Data Warehouse and SQL Server 2016. The major features of PolyBase include the ability to do ad-hoc queries on Hadoop data and the ability to import data from Hadoop and Azure blob storage to SQL Server for persistent storage. A major part of the presentation will be a demo on querying and creating data on HDFS (using Azure Blobs). Come see why PolyBase is the “glue” to creating federated data warehouse solutions where you can query data as it sits instead of having to move it all to one data platform.
Feature Store as a Data Foundation for Machine LearningProvectus
This document discusses feature stores and their role in modern machine learning infrastructure. It begins with an introduction and agenda. It then covers challenges with modern data platforms and emerging architectural shifts towards things like data meshes and feature stores. The remainder discusses what a feature store is, reference architectures, and recommendations for adopting feature stores including leveraging existing AWS services for storage, catalog, query, and more.
How to identify the correct Master Data subject areas & tooling for your MDM...Christopher Bradley
1. What are the different Master Data Management (MDM) architectures?
2. How can you identify the correct Master Data subject areas & tooling for your MDM initiative?
3. A reference architecture for MDM.
4. Selection criteria for MDM tooling.
chris.bradley@dmadvisors.co.uk
공공기관의 클라우드 도입이 급속도로 확산됨에 따라 네이버클라우드플랫폼의 클라우드 서비스가 시장에서 각광받고 있습니다. 이에 따라 네이버클라우드플랫폼의 서비스들을 살펴보고, 왜 클라우드 기술이 각광받는 지 공공기관에서는 클라우드의 어떤 기술들에 주목해야하는지 알려드립니다 | With the rapid spread of public institutions' adoption of cloud, Naver Cloud Platform's cloud services are gaining traction in the market. We look at Naver Cloud Platform's services, and why cloud technology is so popular that public organizations will be able to focus on what technologies are in the cloud.
The document provides an overview of the Extract, Transform, Load (ETL) process. It defines ETL as extracting data from databases, transforming the format or cleaning the data, and loading it into a data warehouse or data mart. It contrasts ETL tools, which move data between databases, from business intelligence (BI) tools, which allow querying and visualization of data. Key aspects of ETL covered include source and target mapping, data validation and quality checks, and testing approaches. Challenges and best practices for ETL are also discussed.
Databricks is a Software-as-a-Service-like experience (or Spark-as-a-service) that is a tool for curating and processing massive amounts of data and developing, training and deploying models on that data, and managing the whole workflow process throughout the project. It is for those who are comfortable with Apache Spark as it is 100% based on Spark and is extensible with support for Scala, Java, R, and Python alongside Spark SQL, GraphX, Streaming and Machine Learning Library (Mllib). It has built-in integration with many data sources, has a workflow scheduler, allows for real-time workspace collaboration, and has performance improvements over traditional Apache Spark.
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureKai Wähner
Apache Kafka in conjunction with Apache Spark became the de facto standard for processing and analyzing data. Both frameworks are open, flexible, and scalable.
Unfortunately, the latter makes operations a challenge for many teams. Ideally, teams can use serverless SaaS offerings to focus on business logic. However, hybrid and multi-cloud scenarios require a cloud-native platform that provides automated and elastic tooling to reduce the operations burden.
This session explores different architectures to build serverless Apache Kafka and Apache Spark multi-cloud architectures across regions and continents.
We start from the analytics perspective of a data lake and explore its relation to a fully integrated data streaming layer with Kafka to build a modern data Data Lakehouse.
Real-world use cases show the joint value and explore the benefit of the "delta lake" integration.
Embarking on building a modern data warehouse in the cloud can be an overwhelming experience due to the sheer number of products that can be used, especially when the use cases for many products overlap others. In this talk I will cover the use cases of many of the Microsoft products that you can use when building a modern data warehouse, broken down into four areas: ingest, store, prep, and model & serve. It’s a complicated story that I will try to simplify, giving blunt opinions of when to use what products and the pros/cons of each.
Databricks: A Tool That Empowers You To Do More With DataDatabricks
In this talk we will present how Databricks has enabled the author to achieve more with data, enabling one person to build a coherent data project with data engineering, analysis and science components, with better collaboration, better productionalization methods, with larger datasets and faster.
The talk will include a demo that will illustrate how the multiple functionalities of Databricks help to build a coherent data project with Databricks jobs, Delta Lake and auto-loader for data engineering, SQL Analytics for Data Analysis, Spark ML and MLFlow for data science, and Projects for collaboration.
Good systems development often depends on multiple data management disciplines. One of these is metadata. While much of the discussion around metadata focuses on understanding metadata itself along with associated technologies, this comprehensive issue often represents a typical tool-and-technology focus, which has not achieved significant results. A more relevant question when considering pockets of metadata is whether to include them in the scope of organizational metadata practices. By understanding metadata practices, you can begin to build systems that allow you to exercise sophisticated data management techniques and support business initiatives.
Learning Objectives:
How to leverage metadata in support of your business strategy
Understanding foundational metadata concepts based on the DAMA DMBOK
Guiding principles & lessons learned
This session covers topics related to data archiving and sharing. This includes data formats, metadata, controlled vocabularies, preservation, archiving and repositories.
Applying DevOps to Databricks can be a daunting task. In this talk this will be broken down into bite size chunks. Common DevOps subject areas will be covered, including CI/CD (Continuous Integration/Continuous Deployment), IAC (Infrastructure as Code) and Build Agents.
We will explore how to apply DevOps to Databricks (in Azure), primarily using Azure DevOps tooling. As a lot of Spark/Databricks users are Python users, will will focus on the Databricks Rest API (using Python) to perform our tasks.
Azure Data Factory is a cloud data integration service that allows users to create data-driven workflows (pipelines) comprised of activities to move and transform data. Pipelines contain a series of interconnected activities that perform data extraction, transformation, and loading. Data Factory connects to various data sources using linked services and can execute pipelines on a schedule or on-demand to move data between cloud and on-premises data stores and platforms.
Gathering Business Requirements for Data WarehousesDavid Walker
This document provides an overview of the process for gathering business requirements for a data management and warehousing project. It discusses why requirements are gathered, the types of requirements needed, how business processes create data in the form of dimensions and measures, and how the gathered requirements will be used to design reports to meet business needs. A straw-man proposal is presented as a starting point for further discussion.
The document discusses employee spend management and provides an overview of Concur Travel and Expense. It outlines challenges with managing employee spending such as lack of visibility, incomplete data, and noncompliance. Concur offers an integrated spend management solution that provides controls, reduces costs, increases adoption of processes, and gives visibility into spending data. The solution is hosted, requires minimal IT involvement, and integrates with other systems.
Data mesh is a decentralized approach to managing and accessing analytical data at scale. It distributes responsibility for data pipelines and quality to domain experts. The key principles are domain-centric ownership, treating data as a product, and using a common self-service infrastructure platform. Snowflake is well-suited for implementing a data mesh with its capabilities for sharing data and functions securely across accounts and clouds, with built-in governance and a data marketplace for discovery. A data mesh implemented on Snowflake's data cloud can support truly global and multi-cloud data sharing and management according to data mesh principles.
The document discusses Azure Data Factory v2. It provides an agenda that includes topics like triggers, control flow, and executing SSIS packages in ADFv2. It then introduces the speaker, Stefan Kirner, who has over 15 years of experience with Microsoft BI tools. The rest of the document consists of slides on ADFv2 topics like the pipeline model, triggers, activities, integration runtimes, scaling SSIS packages, and notes from the field on using SSIS packages in ADFv2.
Part IV covers the content framework, architecture repository and metamodel. Part V discusses the Enterprise Continuum and Tools Requirements. Part VI covers two example reference models, while Part VII looks at the Enterprise Architecture Capability Framework.
Activate Data Governance Using the Data CatalogDATAVERSITY
This document discusses activating data governance using a data catalog. It compares active vs passive data governance, with active embedding governance into people's work through a catalog. The catalog plays a key role by allowing stewards to document definition, production, and usage of data in a centralized place. For governance to be effective, metadata from various sources must be consolidated and maintained in the catalog.
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...Databricks
Many had dubbed 2020 as the decade of data. This is indeed an era of data zeitgeist.
From code-centric software development 1.0, we are entering software development 2.0, a data-centric and data-driven approach, where data plays a central theme in our everyday lives.
As the volume and variety of data garnered from myriad data sources continue to grow at an astronomical scale and as cloud computing offers cheap computing and data storage resources at scale, the data platforms have to match in their abilities to process, analyze, and visualize at scale and speed and with ease — this involves data paradigm shifts in processing and storing and in providing programming frameworks to developers to access and work with these data platforms.
In this talk, we will survey some emerging technologies that address the challenges of data at scale, how these tools help data scientists and machine learning developers with their data tasks, why they scale, and how they facilitate the future data scientists to start quickly.
In particular, we will examine in detail two open-source tools MLflow (for machine learning life cycle development) and Delta Lake (for reliable storage for structured and unstructured data).
Other emerging tools such as Koalas help data scientists to do exploratory data analysis at scale in a language and framework they are familiar with as well as emerging data + AI trends in 2021.
You will understand the challenges of machine learning model development at scale, why you need reliable and scalable storage, and what other open source tools are at your disposal to do data science and machine learning at scale.
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsKhalid Salama
In essence, a data lake is commodity distributed file system that acts as a repository to hold raw data file extracts of all the enterprise source systems, so that it can serve the data management and analytics needs of the business. A data lake system provides means to ingest data, perform scalable big data processing, and serve information, in addition to manage, monitor and secure the it environment. In these slide, we discuss building data lakes using Azure Data Factory and Data Lake Analytics. We delve into the architecture if the data lake and explore its various components. We also describe the various data ingestion scenarios and considerations. We introduce the Azure Data Lake Store, then we discuss how to build Azure Data Factory pipeline to ingest the data lake. After that, we move into big data processing using Data Lake Analytics, and we delve into U-SQL.
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Cloudera, Inc.
Speaker: Hari Shreedharan
Data Day Texas 2015
Apache Spark has emerged over the past year as the imminent successor to Hadoop MapReduce. Spark can process data in memory at very high speed, while still be able to spill to disk if required. Spark’s powerful, yet flexible API allows users to write complex applications very easily without worrying about the internal workings and how the data gets processed on the cluster.
Spark comes with an extremely powerful Streaming API to process data as it is ingested. Spark Streaming integrates with popular data ingest systems like Apache Flume, Apache Kafka, Amazon Kinesis etc. allowing users to process data as it comes in.
In this talk, Hari will discuss the basics of Spark Streaming, its API and its integration with Flume, Kafka and Kinesis. Hari will also discuss a real-world example of a Spark Streaming application, and how code can be shared between a Spark application and a Spark Streaming application. Each stage of the application execution will be presented, which can help understand practices while writing such an application. Hari will finally discuss how to write a custom application and a custom receiver to receive data from other systems.
First introduced with the Analytics Platform System (APS), PolyBase simplifies management and querying of both relational and non-relational data using T-SQL. It is now available in both Azure SQL Data Warehouse and SQL Server 2016. The major features of PolyBase include the ability to do ad-hoc queries on Hadoop data and the ability to import data from Hadoop and Azure blob storage to SQL Server for persistent storage. A major part of the presentation will be a demo on querying and creating data on HDFS (using Azure Blobs). Come see why PolyBase is the “glue” to creating federated data warehouse solutions where you can query data as it sits instead of having to move it all to one data platform.
Feature Store as a Data Foundation for Machine LearningProvectus
This document discusses feature stores and their role in modern machine learning infrastructure. It begins with an introduction and agenda. It then covers challenges with modern data platforms and emerging architectural shifts towards things like data meshes and feature stores. The remainder discusses what a feature store is, reference architectures, and recommendations for adopting feature stores including leveraging existing AWS services for storage, catalog, query, and more.
How to identify the correct Master Data subject areas & tooling for your MDM...Christopher Bradley
1. What are the different Master Data Management (MDM) architectures?
2. How can you identify the correct Master Data subject areas & tooling for your MDM initiative?
3. A reference architecture for MDM.
4. Selection criteria for MDM tooling.
chris.bradley@dmadvisors.co.uk
공공기관의 클라우드 도입이 급속도로 확산됨에 따라 네이버클라우드플랫폼의 클라우드 서비스가 시장에서 각광받고 있습니다. 이에 따라 네이버클라우드플랫폼의 서비스들을 살펴보고, 왜 클라우드 기술이 각광받는 지 공공기관에서는 클라우드의 어떤 기술들에 주목해야하는지 알려드립니다 | With the rapid spread of public institutions' adoption of cloud, Naver Cloud Platform's cloud services are gaining traction in the market. We look at Naver Cloud Platform's services, and why cloud technology is so popular that public organizations will be able to focus on what technologies are in the cloud.
The document provides an overview of the Extract, Transform, Load (ETL) process. It defines ETL as extracting data from databases, transforming the format or cleaning the data, and loading it into a data warehouse or data mart. It contrasts ETL tools, which move data between databases, from business intelligence (BI) tools, which allow querying and visualization of data. Key aspects of ETL covered include source and target mapping, data validation and quality checks, and testing approaches. Challenges and best practices for ETL are also discussed.
Databricks is a Software-as-a-Service-like experience (or Spark-as-a-service) that is a tool for curating and processing massive amounts of data and developing, training and deploying models on that data, and managing the whole workflow process throughout the project. It is for those who are comfortable with Apache Spark as it is 100% based on Spark and is extensible with support for Scala, Java, R, and Python alongside Spark SQL, GraphX, Streaming and Machine Learning Library (Mllib). It has built-in integration with many data sources, has a workflow scheduler, allows for real-time workspace collaboration, and has performance improvements over traditional Apache Spark.
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureKai Wähner
Apache Kafka in conjunction with Apache Spark became the de facto standard for processing and analyzing data. Both frameworks are open, flexible, and scalable.
Unfortunately, the latter makes operations a challenge for many teams. Ideally, teams can use serverless SaaS offerings to focus on business logic. However, hybrid and multi-cloud scenarios require a cloud-native platform that provides automated and elastic tooling to reduce the operations burden.
This session explores different architectures to build serverless Apache Kafka and Apache Spark multi-cloud architectures across regions and continents.
We start from the analytics perspective of a data lake and explore its relation to a fully integrated data streaming layer with Kafka to build a modern data Data Lakehouse.
Real-world use cases show the joint value and explore the benefit of the "delta lake" integration.
Embarking on building a modern data warehouse in the cloud can be an overwhelming experience due to the sheer number of products that can be used, especially when the use cases for many products overlap others. In this talk I will cover the use cases of many of the Microsoft products that you can use when building a modern data warehouse, broken down into four areas: ingest, store, prep, and model & serve. It’s a complicated story that I will try to simplify, giving blunt opinions of when to use what products and the pros/cons of each.
Databricks: A Tool That Empowers You To Do More With DataDatabricks
In this talk we will present how Databricks has enabled the author to achieve more with data, enabling one person to build a coherent data project with data engineering, analysis and science components, with better collaboration, better productionalization methods, with larger datasets and faster.
The talk will include a demo that will illustrate how the multiple functionalities of Databricks help to build a coherent data project with Databricks jobs, Delta Lake and auto-loader for data engineering, SQL Analytics for Data Analysis, Spark ML and MLFlow for data science, and Projects for collaboration.
Good systems development often depends on multiple data management disciplines. One of these is metadata. While much of the discussion around metadata focuses on understanding metadata itself along with associated technologies, this comprehensive issue often represents a typical tool-and-technology focus, which has not achieved significant results. A more relevant question when considering pockets of metadata is whether to include them in the scope of organizational metadata practices. By understanding metadata practices, you can begin to build systems that allow you to exercise sophisticated data management techniques and support business initiatives.
Learning Objectives:
How to leverage metadata in support of your business strategy
Understanding foundational metadata concepts based on the DAMA DMBOK
Guiding principles & lessons learned
This session covers topics related to data archiving and sharing. This includes data formats, metadata, controlled vocabularies, preservation, archiving and repositories.
Applying DevOps to Databricks can be a daunting task. In this talk this will be broken down into bite size chunks. Common DevOps subject areas will be covered, including CI/CD (Continuous Integration/Continuous Deployment), IAC (Infrastructure as Code) and Build Agents.
We will explore how to apply DevOps to Databricks (in Azure), primarily using Azure DevOps tooling. As a lot of Spark/Databricks users are Python users, will will focus on the Databricks Rest API (using Python) to perform our tasks.
Azure Data Factory is a cloud data integration service that allows users to create data-driven workflows (pipelines) comprised of activities to move and transform data. Pipelines contain a series of interconnected activities that perform data extraction, transformation, and loading. Data Factory connects to various data sources using linked services and can execute pipelines on a schedule or on-demand to move data between cloud and on-premises data stores and platforms.
Gathering Business Requirements for Data WarehousesDavid Walker
This document provides an overview of the process for gathering business requirements for a data management and warehousing project. It discusses why requirements are gathered, the types of requirements needed, how business processes create data in the form of dimensions and measures, and how the gathered requirements will be used to design reports to meet business needs. A straw-man proposal is presented as a starting point for further discussion.
The document discusses employee spend management and provides an overview of Concur Travel and Expense. It outlines challenges with managing employee spending such as lack of visibility, incomplete data, and noncompliance. Concur offers an integrated spend management solution that provides controls, reduces costs, increases adoption of processes, and gives visibility into spending data. The solution is hosted, requires minimal IT involvement, and integrates with other systems.
Writing Effective Policies & Procedures2noha1309
The document discusses how to write effective organizational policies and procedures. It covers identifying the need for policies and procedures, understanding the differences between them, how they link to organizational values, the process for writing them, publishing and implementing them, and revising them. Key aspects include determining what should be a policy versus a procedure, following guidelines for clear and consistent formatting, involving stakeholders, and effectively communicating policies and procedures to employees. The overall process flows from identifying needs to drafting, reviewing, approving, distributing, training on, and revising documents over time.
This document discusses blood bags and their anticoagulants. It begins with a brief history of blood banking and milestones in blood preservation. It then describes the invention of blood bags by Dr. Carl Walter and their key components like tubing, labels, and safety features. The document outlines different types of blood bags and their uses. Finally, it examines common anticoagulants like ACD, CPD, and CPDA-1 that are used to prevent coagulation and preserve red blood cells in storage.
This document discusses cash flow management for businesses. It provides an overview of basic financial reports like the balance sheet, income statement, and statement of cash flows. It emphasizes the importance of cash flow planning and analysis, as many small businesses fail due to cash flow issues rather than lack of profitability. The key aspects of cash management covered are forecasting cash receipts and disbursements to create a cash budget, managing accounts receivable, accounts payable, and inventory levels to optimize cash flow. Strategies are provided for accelerating cash collection, negotiating payment terms, and avoiding cash shortages.
The 2015 Quality Management System Vendor Benchmark Report is a survey across industries and geographies on IT/Technology systems used today for Quality Management. Survey had 700+ participants worldwide. Comparisons to 2014 Benchmark are also included. A complete Vendor Listing is located in the Appendix.
The document discusses computer networks and their uses. Computer networks allow computers to exchange information and resources over various connection types. They are used in businesses to share files, printers, and databases between employees in different locations. Networks also enable communication functions like email and videoconferencing. Additionally, networks facilitate online shopping and transactions between businesses and consumers. While networks provide many benefits, they also introduce new social and security issues around topics like privacy, censorship, and cybercrime.
Kansas has some unique auto laws and requirements for insurance. It is a no-fault state, meaning each driver's insurance covers their own expenses regardless of fault. Minimum requirements include $25,000 bodily injury per person, $50,000 per accident, and $10,000 property damage, plus $4,500 personal injury protection due to being a no-fault state. Auto insurance rates in Kansas are consistently below the national average and among the most affordable nationally due to competitive markets and fewer accidents per capita in more rural areas. Shopping around and getting quotes from multiple insurers can help residents find the cheapest rates.
Concept and characteristics of biodiversityAnil kumar
Biodiversity refers to the variety of life on Earth across genes, species, and ecosystems. It is estimated that 1.7 million species have been described, with insects making up the largest group. Biodiversity provides intrinsic value as well as utilitarian value through services like food, medicine, and ecosystem functions. However, biodiversity is declining rapidly due to factors like habitat loss, pollution, and climate change. Maintaining biodiversity is important for ecological stability and human well-being.
Energy drinks can contain as much caffeine as 5 cans of soda and consumption has increased dramatically among teenagers and young adults in recent years. While moderate caffeine consumption may provide some benefits, energy drinks pose health risks due to potentially dangerous ingredients and interactions with alcohol. Manufacturers are not required to label or regulate caffeine content and combinations of energy drinks and alcohol can increase the risks of dangerous behavior. Healthier alternatives like water, milk, and 100% fruit juice provide hydration without caffeine and are less expensive options.
Looking to understand how hackers and other attackers use cyber technology to attack your network and your executives? This slide set provides an overview and details the anatomy of a cyber attack, and the strategies you can use to manage and mitigate risk.
1) There are two main hardening mechanisms for dental cements - acid-base reactions and polymerization reactions. Common cements that use acid-base reactions include zinc phosphate, polycarboxylate, and glass ionomer cements. Resin cements use a polymerization reaction.
2) Zinc phosphate cement has a long history of success but lacks adhesion and fluoride release. Polycarboxylate cement bonds to tooth structure and has short mixing/working times. Glass ionomer cement releases fluoride and bonds to tooth structure.
3) Resin-modified glass ionomer cement combines the benefits of glass ionomer cement with the strength and handling of resin, providing good early strength and reduced moisture sensitivity.
An ERP system unifies database input, processing and retrieval across business units. ERP applications are deployed across locations and have three areas: a centralized database, clients that input data and submit requests, and an application component connecting clients and database. Enterprise architecture translates business vision into effective enterprise change by defining models of the future state and evolution. The two main ERP architectures are two-tier, where the server handles applications and database, and three-tier client/server, where database and application functions are separated, requiring two network connections between client, application server and database server.
The document discusses channel partner management and its benefits. It outlines key components of an effective channel partner management strategy including aligning with partners, appraising the current situation, developing a strategy, and maximizing return on investment. An effective channel partner management strategy can improve channel relationships, reduce costs of sales, increase partner loyalty, and achieve long-term win-win partnerships.
This document provides resources for an executive director position, including recommendation letter samples, interview questions, resume examples, and tips for writing recommendation letters. It highlights the top materials available for download including seven recommendation letter samples, eight resume samples, and a free ebook with 75 interview questions and answers.
In this file, you can ref interview materials for executive such as, executive situational interview, executive behavioral interview, executive phone interview, executive interview thank you letter, executive interview tips …
72 executive interview questions and answers
free pdf download ebook
The document discusses the consumer electronics industry globally and in India. It provides an overview of key players, trends, opportunities and challenges in the industry. It then discusses a potential business model and marketing strategy for an electronics retail store in India, including target customer segments, differentiation strategies, and the marketing mix.
This document discusses agile testing processes. It outlines that agile is an iterative development methodology where requirements evolve through collaboration. It also discusses that testers should be fully integrated team members who participate in planning and requirements analysis. When adopting agile, testing activities like planning, automation, and providing feedback remain the same but are done iteratively in sprints with the whole team responsible for quality.
Read this article for details about the basics of pediatrics medical billing and why outsourcing this billing task can be advantageous for practitioners.
Diarrhea is a common condition that causes loose or watery stools. It can be caused by viruses, bacteria, parasites, medication side effects, and certain medical conditions. The presentation provides an overview of diarrhea, including causes, symptoms, treatment, and prevention.
An Overview on Data Quality Issues at Data Staging ETLidescitation
A data warehouse (DW) is a collection of technologies
aimed at enabling the decision maker to make better and
faster decisions. Data warehouses differ from operational
databases in that they are subject oriented, integrated, time
variant, non volatile, summarized, larger, not normalized, and
perform OLAP. The generic data warehouse architecture
consists of three layers (data sources, DSA, and primary data
warehouse). During the ETL process, data is extracted from
an OLTP databases, transformed to match the data warehouse
schema, and loaded into the data warehouse database
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of UsBertram Ludäscher
PWE: Datalog & ASP for the Rest of Us discusses using Possible Worlds Explorer (PWE) to make Datalog and Answer Set Programming (ASP) more accessible to non-experts. It covers topics like using provenance to explain query results, capturing rule firings to track provenance, representing provenance as a graph, using states to track derivation rounds, and declarative profiling of Datalog programs. The presentation advocates for tools like PWE that wrap Datalog/ASP engines to combine them with Python ecosystems and allow interactive use in Jupyter notebooks. This makes the languages more approachable and helps users build on existing work by experimenting further.
Intelligent agents in ontology-based applicationsinfopapers
The document describes the development of an intelligent agent using JADE that is linked to a knowledge base system implemented with Protege and Algernon. The agent delivers useful information to users from the web or other agents based on their preferences. The knowledge base contains ontologies defined in Protege and facts that can be queried using if-then rules in Algernon. The example application was developed in Java Studio Creator to demonstrate an intelligent information agent.
This document introduces object-oriented programming (OOP). It discusses the software crisis and need for new approaches like OOP. The key concepts of OOP like objects, classes, encapsulation, inheritance and polymorphism are explained. Benefits of OOP like reusability, extensibility and managing complexity are outlined. Real-time systems, simulation, databases and AI are examples of promising applications of OOP. The document was presented by Prof. Dipak R Raut at International Institute of Information Technology, Pune.
Chapter 1 - Introduction to System Integration and Architecture.pdfKhairul Anwar Sedek
The document discusses system integration and architecture. It begins with basic definitions of key terms like system integration, enterprise application integration, and system architecture. It then covers common integration approaches like vertical integration, horizontal integration, and enterprise service buses. The document also discusses system architecture models and lifecycles. It explains the roles and responsibilities of systems architects and necessary skills and education to work in the field.
The document summarizes the Zachman Framework, an alternative approach to organizing systems development proposed by John Zachman in 1987. The Zachman Framework organizes systems development around different perspectives (rows), including strategic planning, business owner, architect, designer, builder, and functioning system. It also addresses different aspects (columns), including data, functions, network, people, time, and motivation. This provides a more comprehensive view than traditional system development life cycles, which focus only on data and functions and view development as a linear process. The Zachman Framework emphasizes understanding each perspective and starting system development by accurately capturing the business owner's view of how the business operates.
Configuration management for lyee softwaremariase324
Configuration management for lyee software
Esta es una herramienta por la cual, Esta es una herramienta por la cual4 Esta es una herramienta por la cual7Esta es una herramienta por la cualEsta es una herramienta por la cualEsta es una herramienta por la cualEsta es una herramienta por la cualEsta es una herramienta por la cualEsta es una herramienta por la cualEsta es una herramienta por la cualEsta es una herramienta por la cualEsta es una herramienta por la cualEsta es una herramienta por la cualEsta es una herramienta por la cualEsta es una herramienta por la cualEsta es una herramienta por la cualEsta es una herramienta por la cualEsta es una herramienta por la cualEsta es una herramienta por la cualEsta es una herramienta por la cualEsta es una herramienta por la cualEsta es una herramienta por la cualEsta es una herramienta por la cualEsta es una herramienta por la cualEsta es una herramienta por la cualEsta es una herramienta por la cualEsta es una herramienta por la cualEsta es una herramienta por la cualEsta es una herramienta por la cualEsta es una herramienta por la cualEsta es una herramienta por la cualEsta es una herramienta por la cualEsta es una herramienta por la cualEsta es una herramienta por la cualEsta es una herramienta por la cualEsta es una herramienta por la cualEsta es una herramienta por la cualEsta es una herramienta por la cualEsta es una herramienta por la cualEsta es una herramienta por la cualEsta es una herramienta por la cualEsta es una herramienta por la cualEsta es una herramienta por la cualEsta es una herramienta por la cualEsta es una herramienta por la cualEsta es una herramienta por la cualEsta es una herramienta por la cualEsta es una herramienta por la cualEsta es una herramienta por la cualEsta es una herramienta por la cualEsta es una herramienta por la cualEsta es una herramienta por la cualEsta es una herramienta por la cualEsta es una herramienta por la cualEsta es una herramienta por la cualEsta es una herramienta por la cualEsta es una herramienta por la cualEsta es una herramienta por la cualEsta es una herramienta por la cualEsta es una herramienta por la cualEsta es una herramienta por la cualv
This document provides an overview of object oriented analysis and design using the Unified Modeling Language (UML). It discusses key concepts in object oriented programming like classes, objects, encapsulation, inheritance and polymorphism. It also outlines the software development lifecycle and phases like requirements analysis, design, coding, testing and maintenance. Finally, it introduces UML and explains how use case diagrams can be used to model the user view of a system by defining actors and use cases.
AN AI PLANNING APPROACH FOR GENERATING BIG DATA WORKFLOWSgerogepatton
The scale of big data causes the compositions of extract-transform-load (ETL) workflows to grow increasingly complex. With the turnaround time for delivering solutions becoming a greater emphasis, stakeholders cannot continue to afford to wait the hundreds of hours it takes for domain experts to manually compose a workflow solution. This paper describes a novel AI planning approach that facilitates rapid composition and maintenance of ETL workflows. The workflow engine is evaluated on real-world scenarios from an industrial partner and results gathered from a prototype are reported to demonstrate the validity of the approach.
The scale of big data causes the compositions of extract-transform-load (ETL) workflows to grow increasingly complex. With the turnaround time for delivering solutions becoming a greater emphasis,
stakeholders cannot continue to afford to wait the hundreds of hours it takes for domain experts to manually compose a workflow solution. This paper describes a novel AI planning approach that facilitates rapid composition and maintenance of ETL workflows. The workflow engine is evaluated on real-world
scenarios from an industrial partner and results gathered from a prototype are reported to demonstrate the validity of the approach.
The document presents the "4+1" view model for describing software architectures. It consists of five views: the logical view, process view, physical view, development view, and use case scenarios. Each view addresses different stakeholder concerns and can be described using its own notation. The logical view describes the object-oriented decomposition. The process view addresses concurrency and distribution. The physical view maps software to hardware. The development view describes module organization. Together these views provide a comprehensive architecture description that addresses multiple stakeholder needs.
The document presents the "4+1" view model for describing software architectures. It consists of five views: the logical view, process view, physical view, development view, and use case scenarios. Each view addresses different stakeholder concerns and can be described using its own notation. The logical view describes the object-oriented decomposition. The process view addresses concurrency and distribution. The physical view maps software to hardware. The development view describes module organization. Together these views provide a comprehensive architecture description that addresses multiple stakeholder needs.
This document provides an overview of software architectures and architectural structures. It discusses different types of architectural structures, including module structures, component-and-connector structures, and allocation structures. Module structures focus on modules and their relationships, component-and-connector structures examine runtime components and connectors, and allocation structures show how software elements map to environments. The document then examines specific architectural structures like modules, layers, classes, processes, repositories, and deployment. It emphasizes that an architect should focus on a few key structures like logical, process, development, and physical views to validate that the architecture meets requirements.
Is The Architectures Of The Convnets ) For Action...Sheila Guy
The document discusses enterprise architecture (EA), which is defined as a conceptual blueprint that defines the structure and operation of an organization. The purpose of EA is to understand how an organization can most effectively achieve its current and future objectives. Some key benefits of EA include taking a holistic approach, ensuring consistency when delivering solutions to business problems, and aligning business and IT strategies to effectively use IT assets and support the organization's goals. EA is similar to city planning in providing an overall framework and context.
The document defines various elements of function point analysis including:
1. File Type References (FTRs), Internal Logical Files (ILFs), External Interface Files (EIFs), External Input (EI), External Output (EO), External Inquiry (EQ), and General System Characteristics (GSCs) which are the main components measured in a function point analysis.
2. It provides descriptions of each component - FTRs refer to files referenced by transactions, ILFs and EIFs are files stored internally or externally, EI involves data entering the system, EO is data exiting, and EQ retrieves data without updates.
3. GSCs consider other factors like architecture and performance that
This document summarizes a lecture on dealing with large-scale web data using large-scale file systems and MapReduce. It introduces MapReduce basics like its programming model and word count example. It also discusses large-scale file systems like Google File System (GFS), which stores data in chunks across multiple servers and provides replication for reliability. GFS assumptions include commodity hardware, high component failure rates, and large streaming reads over random access.
NEAR-REAL-TIME PARALLEL ETL+Q FOR AUTOMATIC SCALABILITY IN BIGDATA cscpconf
In this paper we investigate the problem of providing scalability to near-real-time ETL+Q (Extract, transform, load and querying) process of data warehouses. In general, data loading, transformation and integration are heavy tasks that are performed only periodically during small fixed time windows. We propose an approach to enable the automatic scalability and freshness of any data warehouse and ETL+Q process for near-real-time BigData scenarios. A general framework for testing the proposed system was implementing, supporting parallelization solutions for each part of the ETL+Q pipeline. The results show that the proposed system is capable of handling scalability to provide the desired processing speed.
NEAR-REAL-TIME PARALLEL ETL+Q FOR AUTOMATIC SCALABILITY IN BIGDATAcsandit
In this paper we investigate the problem of providing scalability to near-real-time ETL+Q (Extract, transform, load and querying) process of data warehouses. In general, data loading,
transformation and integration are heavy tasks that are performed only periodically during small fixed time windows.
NEAR-REAL-TIME PARALLEL ETL+Q FOR AUTOMATIC SCALABILITY IN BIGDATA
Etl design document
1. ARTICLE IN PRESS
Information Systems 30 (2005) 492–525
www.elsevier.com/locate/infosys
A generic and customizable framework for the design
of ETL scenarios
Panos Vassiliadisa, Alkis Simitsisb, Panos Georgantasb, Manolis Terrovitisb,
Spiros Skiadopoulosb
a
Department of Computer Science, University of Ioannina, Ioannina, Greece
b
Department of Electrical and Computer Engineering, National Technical University of Athens, Athens, Greece
Abstract
Extraction–transformation–loading (ETL) tools are pieces of software responsible for the extraction of data from
several sources, their cleansing, customization and insertion into a data warehouse. In this paper, we delve into the
logical design of ETL scenarios and provide a generic and customizable framework in order to support the DW
designer in his task. First, we present a metamodel particularly customized for the definition of ETL activities. We
follow a workflow-like approach, where the output of a certain activity can either be stored persistently or passed to a
subsequent activity. Also, we employ a declarative database programming language, LDL, to define the semantics of
each activity. The metamodel is generic enough to capture any possible ETL activity. Nevertheless, in the pursuit of
higher reusability and flexibility, we specialize the set of our generic metamodel constructs with a palette of frequently
used ETL activities, which we call templates. Moreover, in order to achieve a uniform extensibility mechanism for this
library of built-ins, we have to deal with specific language issues. Therefore, we also discuss the mechanics of template
instantiation to concrete activities. The design concepts that we introduce have been implemented in a tool, ARKTOS II,
which is also presented.
r 2004 Elsevier B.V. All rights reserved.
Keywords: Data warehousing; ETL
1. Introduction data extraction, transformation, integration,
cleaning and transport. To deal with this work-
Data warehouse operational processes normally flow, specialized tools are already available in the
compose a labor-intensive workflow, involving market [1–4], under the general title Extraction—
Transformation– Loading (ETL) tools. To give a
general idea of the functionality of these tools we
E-mail addresses: pvassil@cs.uoi.gr (P. Vassiliadis), asimi@
dbnet.ece.ntua.gr (A. Simitsis), pgeor@dbnet.ece.ntua.gr
mention their most prominent tasks, which include
(P. Georgantas), mter@dbnet.ece.ntua.gr (M. Terrovitis), (a) the identification of relevant information at the
spiros@dbnet.ece.ntua.gr (S. Skiadopoulos). source side, (b) the extraction of this information,
0306-4379/$ - see front matter r 2004 Elsevier B.V. All rights reserved.
doi:10.1016/j.is.2004.11.002
2. ARTICLE IN PRESS
P. Vassiliadis et al. / Information Systems 30 (2005) 492–525 493
Logical Perspective Physical Perspective
Execution Plan
Resource Layer
Execution Sequence
Execution Schedule
Recovery Plan
Relationship
with data Operational Layer
Primary Data Flow
Data Flowfor Logical Exceptions
Administration Plan
Monitoring & Logging
Security & Access Rights Management
Fig. 1. Different perspectives for an ETL workflow.
(c) the customization and integration of the defining an execution plan for the scenario. The
information coming from multiple sources into a definition of an execution plan can be seen from
common format, (d) the cleaning of the resulting various perspectives. The execution sequence in-
data set on the basis of database and business volves the specification of which activity runs first,
rules, and (e) the propagation of the data to the second, and so on, which activities run in parallel,
data warehouse and/or data marts. or when a semaphore is defined so that several
If we treat an ETL scenario as a composite activities are synchronized at a rendezvous point.
workflow, in a traditional way, its designer is ETL activities normally run in batch, so the
obliged to define several of its parameters (Fig. 1). designer needs to specify an execution schedule,
Here, we follow a multi-perspective approach that i.e., the time points or events that trigger the
enables to separate these parameters and study execution of the scenario as a whole. Finally, due
them in a principled approach. We are mainly to system crashes, it is imperative that there exists
interested in the design and administration parts of a recovery plan, specifying the sequence of steps to
the lifecycle of the overall ETL process, and we be taken in the case of failure for a certain activity
depict them at the upper and lower part of Fig. 1, (e.g., retry to execute the activity, or undo any
respectively. At the top of Fig. 1, we are mainly intermediate results produced so far). On the right-
concerned with the static design artifacts for a hand side of Fig. 1, we can also see the physical
workflow environment. We will follow a tradi- perspective, involving the registration of the actual
tional approach and group the design artifacts into entities that exist in the real world. We will reuse
logical and physical, with each category compris- the terminology of [5] for the physical perspective.
ing its own perspective. We depict the logical The resource layer comprises the definition of roles
perspective on the left-hand side of Fig. 1, and the (human or software) that are responsible for
physical perspective on the right-hand side. At the executing the activities of the workflow. The
logical perspective, we classify the design artifacts operational layer, at the same time, comprises the
that give an abstract description of the workflow software modules that implement the design
environment. First, the designer is responsible for entities of the logical perspective in the real world.
3. ARTICLE IN PRESS
494 P. Vassiliadis et al. / Information Systems 30 (2005) 492–525
In other words, the activities defined at the logical activities, which we call templates. Moreover, in
layer (in an abstract way) are materialized and order to achieve a uniform extensibility mechan-
executed through the specific software modules of ism for this library of built-ins, we have to deal
the physical perspective. with specific language issues: thus, we also discuss
At the lower part of Fig. 1, we are dealing with the mechanics of template instantiation to concrete
the tasks that concern the administration of the activities. The design concepts that we introduce
workflow environment and their dynamic beha- have been implemented in a tool, ARKTOS II, which
vior at runtime. First, an administration plan is also presented.
should be specified, involving the notification of Our contributions can be listed as follows:
the administrator either on-line (monitoring) or
off-line (logging) for the status of an executed First, we define a formal metamodel as an
activity, as well as the security and authentication abstraction of ETL processes at the logical level.
management for the ETL environment. The data stores, activities and their constituent
We find that research has not dealt with the parts are formally defined. An activity is defined
definition of data-centric workflows to the entirety as an entity with possibly more than one input
of its extent. In the ETL case, for example, due to schemata, an output schema and a parameter
the data centric nature of the process, the designer schema, so that the activity is populated each
must deal with the relationship of the involved time with its proper parameter values. The flow
activities with the underlying data. This involves the of data from producers towards their consumers
definition of a primary data flow that describes the is achieved through the usage of provider
route of data from the sources towards their final relationships that map the attributes of the
destination in the data warehouse, as they pass former to the respective attributes of the latter.
through the activities of the scenario. Also, due to A serializable combination of ETL activities,
possible quality problems of the processed data, provider relationships and data stores constitu-
the designer is obliged to define a data flow for tes an ETL scenario.
logical exceptions, i.e., a flow for the problematic Second, we provide a reusability framework that
data, i.e., the rows that violate integrity or business complements the genericity of the metamodel.
rules. It is the combination of the execution Practically, this is achieved from a set of ‘‘built-
sequence and the data flow that generates the in’’ specializations of the entities of the meta-
semantics of the ETL workflow: the data flow model layer, specifically tailored for the most
defines what each activity does and the execution frequent elements of ETL scenarios. This palette
plan defines in which order and combination. of template activities will be referred to as
In this paper, we work in the internals of the template layer and it is characterized by its
data flow of ETL scenarios. First, we present a extensibility; in fact, due to language considera-
metamodel particularly customized for the defini- tions, we provide the details of the mechanism
tion of ETL activities. We follow a workflow-like that instantiates templates to specific activities.
approach, where the output of a certain activity Finally, we discuss implementation issues and we
can either be stored persistently or passed to a present a graphical tool, ARKTOS II that facil-
subsequent activity. Moreover, we employ a itates the design of ETL scenarios, based on our
declarative database programming language, model.
LDL, to define the semantics of each activity.
The metamodel is generic enough to capture any This paper is organized as follows. In Section 2,
possible ETL activity; nevertheless, reusability and we present a generic model of ETL activities.
ease-of-use dictate that we can do better in aiding Section 3 describes the mechanism for specifying
the data warehouse designer in his task. In this and materializing template definitions of fre-
pursuit of higher reusability and flexibility, we quently used ETL activities. Section 4 presents
specialize the set of our generic metamodel ARKTOS II, a prototype graphical tool. In Section 5,
constructs with a palette of frequently used ETL we survey related work. In Section 6, we make a
4. ARTICLE IN PRESS
P. Vassiliadis et al. / Information Systems 30 (2005) 492–525 495
general discussion on the completeness and general staging area (DSA)1 DS. The scenario involves the
applicability of our approach. Section 7 offers propagation of data from the table PARTSUPP of
conclusions and presents topics for future re- source S1 to the data warehouse DW. Table
search. Short versions of parts of this paper have DW.PARTSUPP (PKEY, SOURCE, DATE, QTY,
been presented in [6,7]. COST) stores information for the available quan-
tity (QTY) and cost (COST) of parts (PKEY)
per source (SOURCE). The data source S1.
2. Generic model of ETL activities PARTSUPP (PKEY, DATE, QTY, COST) records
the supplies from a specific geographical region,
The purpose of this section is to present a formal e.g., Europe. All the attributes, except for the dates
logical model for the activities of an ETL are instances of the Integer type. The scenario is
environment. This model abstracts from the graphically depicted in Fig. 3 and involves the
technicalities of monitoring, scheduling and log- following transformations.
ging while it concentrates on the flow of data from
the sources towards the data warehouse through 1. First, we transfer via FTP_PS1 the snapshot
the composition of activities and data stores. The from the source S1.PARTSUPP to the file
full layout of an ETL scenario, involving activities, DS.PS1_NEW of the DSA.2
recordsets and functions can be modeled by a 2. In the DSA, we maintain locally a copy of the
graph, which we call the architecture graph. We snapshot of the source as it was at the previous
employ a uniform, graph-modeling framework for loading (we assume here the case of the
both the modeling of the internal structure of incremental maintenance of the DW, instead of
activities and the ETL scenario at large, which the case of the initial loading of the DW). The
enables the treatment of the ETL environment recordset DS.PS1_NEW (PKEY, DATE, QTY,
from different viewpoints. First, the architecture COST) stands for the last transferred snapshot
graph comprises all the activities and data stores of of S1.PARTSUPP. By detecting the difference
a scenario, along with their components. Second, of this snapshot with the respective version of
the architecture graph captures the data flow the previous loading, DS.PS1_OLD (PKEY,
within the ETL environment. Finally, the informa- DATE, QTY, COST), we can derive the newly
tion on the typing of the involved entities and the inserted rows in S1.PARTSUPP. Note that the
regulation of the execution of a scenario, through difference activity that we employ, namely
specific parameters are also covered. Diff_PS1, checks for differences only on the
primary key of the recordsets; thus, we ignore
2.1. Graphical notation and motivating example here any possible deletions or updates for the
attributes COST, QTY of existing rows. Any not
Being a graph, the architecture graph of an ETL newly inserted row is rejected and so, it is
scenario comprises nodes and edges. The involved propagated to Diff_PS1_REJ that stores all
data types, function types, constants, attributes, the rejected rows. The schema of Diff_PS1_
activities, recordsets, parameters and functions REJ is identical to the input schema of the
constitute the nodes of the graph. The different activity Diff_PS1.
kinds of relationships among these entities are
modeled as the edges of the graph. In Fig. 2, we 1
In data warehousing terminology a DSA is an intermediate
give the graphical notation for all the modeling area of the data warehouse, specifically destined to enable the
constructs that will be presented in the sequel. transformation, cleaning and integration of source data, before
Motivating example: To motivate our discus- being loaded to the warehouse.
2
sion, we will present an example involving the The technical points, like FTP, are mostly employed to show
what kind of problems someone has to deal with in a practical
propagation of data from a certain source S1, situation, rather than to relate this kind of physical operations
towards a data warehouse DW through intermedi- to a logical model. In terms of logical modelling this is a simple
ate recordsets. These recordsets belong to a data passing of data from one site to another.
5. ARTICLE IN PRESS
496 P. Vassiliadis et al. / Information Systems 30 (2005) 492–525
Data Types Black ellipsoid RecordSets Cylinders
Function
Black rectangles Functions Gray rectangles
Types
Constants Black circles 1 Parameters White rectangles
Attributes Unshaded ellipsoid Activities Triangles
Provider Bold solid arrows
Part-Of Simple lines with
Relationships (from provider to
Relationships diamond edges*
consumer)
Bold dotted
Dotted arrows Derived
Instance-Of arrows (from
(from instance Provider
Relationships provider to
towards the type) Relationships
consumer)
* We annotate the part-of relationship among a
Regulator
Dotted lines function and its return type with a directed edge, to
Relationships distinguish it from therest of the parameters.
Fig. 2. Graphical notation for the architecture graph.
DS.PS1.PKEY
DS.PS1_NEW.PKEY
LOOKUP.PKEY
= COST SOURCE = 1
LOOKUP.SOURCE
DS.PS1_OLD.PKEY
LOOKUP.SKEY
S1.PARTSUPP FTP_PS1 DS.PS1_NEW
Diff_PS1 NotNu111 Add_Attr1 DS.PS1 SK1 DW.PARTSUPP
Source
DS.PS1_OLD Data
Warehouse
Diff_PS1 Not Nul 111
LOOKUP
_REJ _ REJ
DSA
Fig. 3. Bird’s-eye view of the motivating example.
3. The rows that pass the activity Diff_PS1 are Diff_PS1_REJ recordset for further examina-
checked for null values of the attribute COST tion by the data warehouse administrator.
through the activity NotNull1. Rows having a 4. Although we consider the data flow for only
NULL value for their COST are kept in the one source, namely S1, the data warehouse can
6. ARTICLE IN PRESS
P. Vassiliadis et al. / Information Systems 30 (2005) 492–525 497
clearly have more sources for part supplies. In Finally, before proceeding, we would like to
order to keep track of the source of each row stress that we do not anticipate a manual
entering into the DW, we need to add a ‘flag’ construction of the graph by the designer; rather,
attribute, namely SOURCE, indicating the re- we employ this section to clarify how the graph
spective source. This is achieved through the will look once constructed. To assist a more
activity Add_Attr1. We store the rows that automatic construction of ETL scenarios, we have
stem from this process in the recordset DS.PS1 implemented the ARKTOS II tool that supports the
(PKEY, SOURCE, DATE, QTY, COST). designing process through a friendly GUI. We
5. Next, we assign a surrogate key on PKEY. In the present ARKTOS II in Section 4.
data warehouse context, it is common tactics to
replace the keys of the production systems with 2.2. Preliminaries
a uniform key, which we call a surrogate key [8].
The basic reasons for this replacement are In this subsection, we will introduce the formal
performance and semantic homogeneity. Tex- modeling of data types, data stores and functions,
tual attributes are not the best candidates for before proceeding to the modeling of ETL
indexed keys and thus, they need to be replaced activities.
by integer keys. At the same time, different
production systems might use different keys for Elementary entities: We assume the existence of
the same object, or the same key for different a countable set of data types. Each data type T is
objects, resulting in the need for a global characterized by a name and a domain, i.e., a
replacement of these values in the data ware- countable set of values, called dom (T). The
house. This replacement is performed through a values of the domains are also referred to as
lookup table of the form L (PRODKEY, constants.
SOURCE, SKEY). The SOURCE column is due We also assume the existence of a countable set
to the fact that there can be synonyms in the of attributes, which constitute the most elementary
different sources, which are mapped to different granules of the infrastructure of the information
objects in the data warehouse. In our case, the system. Attributes are characterized by their name
activity that performs the surrogate key assign- and data type. The domain of an attribute is a
ment for the attribute PKEY is SK1. It uses the subset of the domain of its data type. Attributes
lookup table LOOKUP (PKEY, SOURCE, and constants are uniformly referred to as terms.
SKEY). Finally, we populate the data ware-
house with the output of the previous activity. A schema is a finite list of attributes. Each entity
that is characterized by one or more schemata will
The role of rejected rows depends on the be called structured entity. Moreover, we assume
peculiarities of each ETL scenario. If the designer the existence of a special family of schemata, all
needs to administrate these rows further, then he/ under the general name of NULL schema,
she should use intermediate storage recordsets determined to act as placeholders for data which
with the burden of an extra I/O cost. If the rejected are not to be stored permanently in some data
rows should not have a special treatment, then the store. We refer to a family instead of a single
best solution is to be ignored; thus, in this case we NULL schema, due to a subtle technicality
avoid overloading the scenario with any extra involving the number of attributes of such a
storage recordset. In our case, we annotate only schema (this will become clear in the sequel).
two of the presented activities with a destina-
tion for rejected rows. Out of these, while Recordsets: We define a record as the instantia-
NotNull1_REJ absolutely makes sense as a tion of a schema to a list of values belonging to
placeholder for problematic rows having non- the domains of the respective schema attributes.
acceptable NULL values, Diff_PS1_REJ is pre- We can treat any data structure as a re-
sented for demonstration reasons only. cordset provided that there are ways to logically
7. ARTICLE IN PRESS
498 P. Vassiliadis et al. / Information Systems 30 (2005) 492–525
restructure it into a flat, typed record schema. LDL, we avoid dealing with the peculiarities of a
Formally, a recordset is characterized by its name, particular programming language. Once again, we
its (logical) schema and its (physical) extension want to stress that the presented LDL description
(i.e., a finite set of records under the recordset is intended to capture the semantics of each
schema). If we consider a schema S ¼ activity, instead of the way these activities are
[A1, y, Ak], for a certain recordset, its extension actually implemented.
is a mapping S ¼ [A1, y, Ak]-dom(A1) Â y An elementary activity is formally described by
 dom(Ak). Thus, the extension of the recordset the following elements:
is a finite subset of dom(A1)  y  dom(Ak) and
a record is the instance of a mapping dom(A1)
Name: A unique identifier for the activity.
 y  dom(Ak)-[x1,y,xk], xiAdom(Ai).
Input schemata: A finite set of one or more input
schemata that receives data from the data
In the rest of this paper we will mainly deal with
providers of the activity.
the two most popular types of recordsets, namely
relational tables and record files. A database is a
Output schema: A schema that describes the
placeholder for the rows that pass the check
finite set of relational tables.
performed by the elementary activity.
Functions. We assume the existence of a
Rejections schema: A schema that describes the
placeholder for the rows that do not pass the
countable set of built-in system function types. A
check performed by the activity, or their values
function type comprises a name, a finite list of
are not appropriate for the performed transfor-
parameter data types, and a single return data type.
mation.
A function is an instance of a function type.
Consequently, it is characterized by a name, a list
Parameter list: A set of pairs which act as
regulators for the functionality of the activity
of input parameters and a parameter for its return
(the target attribute of a foreign key check, for
value. The data types of the parameters of the
example). The first component of the pair is a
generating function type also define (a) the data
name and the second is a schema, an attribute, a
types of the parameters of the function and (b) the
function or a constant.
legal candidates for the function parameters (i.e.,
attributes or constants of a suitable data type).
Output operational semantics: An LDL state-
ment describing the content passed to the
output of the operation, with respect to its
2.3. Activities
input. This LDL statement defines (a) the
operation performed on the rows that pass
Activities are the backbone of the structure of
through the activity and (b) an implicit mapping
any information system. We adopt the WfMC
between the attributes of the input schema(ta)
terminology [9] for processes/programs and we will
and the respective attributes of the output
call them activities in the sequel. An activity is an
schema.
amount of ‘‘work which is processed by a
combination of resource and computer applica-
Rejection operational semantics: An LDL state-
ment describing the rejected records, in a sense
tions’’ [9]. In our framework, activities are logical
similar to the output operational semantics.
abstractions representing parts or full modules of
This statement is by default considered to be the
code.
complement of the output operational seman-
The execution of an activity is performed from a
tics, except if explicitly defined differently.
particular program. Normally, ETL activities will
be either performed in a black-box manner by a
There are two issues that we would like to
dedicated tool, or they will be expressed in some
elaborate on, here:
language (e.g., PL/SQL, Perl, C). Still, we want to
deal with the general case of ETL activities. We NULL schemata: Whenever we do not specify
employ an abstraction of the source code of an a data consumer for the output or rejec-
activity, in the form of an LDL statement. Using tion schemata, the respective NULL schema
8. ARTICLE IN PRESS
P. Vassiliadis et al. / Information Systems 30 (2005) 492–525 499
(involving the correct number of attributes) is notation of entities (nodes) and relationships
implied. This practically means that the data (edges) is presented in Fig. 2.
targeted for this schema will neither be stored to
some persistent data store, nor will they be Part-of relationships. These relationships in-
propagated to another activity, but they will volve attributes and parameters and relate them
simply be ignored. to the respective activity, recordset or function
Language issues: Initially, we used to specify the to which they belong.
semantics of activities with SQL statements. Instance-of relationships. These relationships are
Still, although clear and easy to write and defined among a data/function type and its
understand, SQL is rather hard to use if one is instances.
to perform rewriting and composition of state- Provider relationships. These are relationships
ments. Thus, we have supplemented SQL with that involve attributes with a provider–consu-
LDL [10], a logic programming, declarative mer relationship.
language as the basis of our scenario definition. Regulator relationships. These relationships are
LDL is a Datalog variant based on a Horn- defined among the parameters of activities and
clause logic that supports recursion, complex the terms that populate these activities.
objects and negation. In the context of its Derived provider relationships. A special case of
implementation in an actual deductive database provider relationships that occurs whenever
management system, LDL++ [11], the lan- output attributes are computed through the
guage has been extended to support external composition of input attributes and parameters.
functions, choice, aggregation (and even, user- Derived provider relationships can be deduced
defined aggregation), updates and several other from a simple rule and do not originally
features. constitute a part of the graph.
2.4. Relationships in the architecture graph In the rest of this subsection, we will detail the
notions pertaining to the relationships of the
In this subsection, we will elaborate on the Architecture Graph; the knowledgeable reader is
different kinds of relationships that the entities of referred to Section 2.5 where we discuss the issue
an ETL scenario have. Whereas these entities are of scenarios. We will base our discussions on a
modeled as the nodes of the architecture graph, part of the scenario of the motivating example
relationships are modeled as its edges. Due to their (presented in Section 2.1), including activity SK1.
diversity, before proceeding, we list these types of
relationships along with the related terminology Data types and instance-of relationships: To
that we will use in this paper. The graphical capture typing information on attributes and
OUT IN OUT IN DW.PARTS
DS.PS1 SK1
UPP
PKEY PKEY PKEY PKEY Integer
QTY QTY QTY QTY
COST COST COST COST
Date DATE DATE DATE DATE
SOURCE SOURCE SOURCE SOURCE
SKEY
Fig. 4. Instance-of relationships of the architecture graph.
9. ARTICLE IN PRESS
500 P. Vassiliadis et al. / Information Systems 30 (2005) 492–525
functions, the architecture graph comprises data tion schema of the activity, respectively). Natu-
and function types. Instantiation relationships are rally, if the activity involves more than one input
depicted as dotted arrows that stem from the schemata, the relationship is tagged with an INi
instances and head toward the data/function types. tag for the ith input schema. We also incorporate
In Fig. 4, we observe the attributes of the two the functions along with their respective para-
activities of our example and their correspondence meters and the part-of relationships among the
to two data types, namely integer and date. former and the latter. We annotate the part-of
For reasons of presentation, we merge several relationship with the return type with a directed
instantiation edges so that the figure does not edge, to distinguish it from the rest of the
become too crowded. parameters.
Fig. 5 depicts a part of the motivating example.
Attributes and part-of relationships: The first In terms of part-of relationships, we present the
thing to incorporate in the architecture graph is decomposition of (a) the recordsets DS.PS1,
the structured entities (activities and recordsets) LOOKUP, DW.PARTSUPP and (b) the activity SK1
along with all the attributes of their schemata. We and the attributes of its input and output
choose to avoid overloading the notation by schemata. Note the tagging of the schemata of
incorporating the schemata per se; instead we the involved activity. We do not consider the
apply a direct part-of relationship between an rejection schemata in order to avoid crowding the
activity node and the respective attributes. We picture. Also note, how the parameters of the
annotate each such relationship with the name of activity are also incorporated in the architecture
the schema (by default, we assume a IN, OUT, graph. Activity SK1 has five parameters: (a) PKEY,
PAR, REJ tag to denote whether the attribute which stands for the production key to be
belongs to the input, output, parameter or rejec- replaced, (b) SOURCE, which stands for an integer
OUT IN OUT IN DW.PARTS
DS.PS1 SK1
UPP
PAR
PKEY PKEY PKEY PKEY
QTY QTY QTY QTY
COST COST COST COST
DATE DATE DATE DATE
SOURCE SOURCE SOURCE SOURCE
SKEY
PKEY
SOURCE
PKEY LPKEY
OUT
LOOKUP SOURCE LSOURCE
SKEY LSKEY
Fig. 5. Part-of regulator and provider relationships of the architecture graph.
10. ARTICLE IN PRESS
P. Vassiliadis et al. / Information Systems 30 (2005) 492–525 501
value that characterizes which source’s data are relationship among the attributes of the involved
processed, (c) LPKEY, which stands for the schemata.
attribute of the lookup table which contains the Formally, a provider relationship is defined by
production keys, (d) LSOURCE, which stands for the following elements:
the attribute of the lookup table which contains
the source value (corresponding to the aforemen- Name: A unique identifier for the provider
tioned SOURCE parameter), (e) LSKEY, which relationship.
stands for the attribute of the lookup table which Mapping: An ordered pair. The first part of the
contains the surrogate keys. pair is a term (i.e., an attribute or constant)
acting as a provider and the second part is an
Parameters and regulator relationships: Once the attribute acting as the consumer.
part-of and instantiation relationships have been
established, it is time to establish the regulator The mapping need not necessarily be 1:1 from
relationships of the scenario. In this case, we link provider to consumer attributes, since an input
the parameters of the activities to the terms attribute can be mapped to more than one
(attributes or constants) that populate them. We consumer attributes. Still, the opposite does not
depict regulator relationships with simple dotted hold. Note that a consumer attribute can also be
edges. populated by a constant, in certain cases.
In the example of Fig. 5 we can also observe In order to achieve the flow of data from the
how the parameters of activity SK1 are populated providers of an activity towards its consumers, we
through regulator relationships. The parameters need the following three groups of provider
in and out are mapped to the respective terms relationships:
through regulator relationships. All the para-
meters of SK1, namely PKEY, SOURCE, LPKEY, 1. A mapping between the input schemata of the
LSOURCE and LSKEY, are mapped to the respec- activity and the output schema of their data
tive attributes of either the activity’s input schema providers. In other words, for each attribute of
or the employed lookup table LOOKUP. The an input schema of an activity, there must exist
parameter LSKEY deserves particular attention. an attribute of the data provider, or a constant,
This parameter is (a) populated from the attribute which is mapped to the former attribute.
SKEY of the lookup table and (b) used to populate 2. A mapping between the attributes of the activity
the attribute SKEY of the output schema of the input schemata and the activity output (or
activity. Thus, two regulator relationships are rejection, respectively) schema.
related with parameter LSKEY, one for each of 3. A mapping between the output or rejection
the aforementioned attributes. The existence of a schema of the activity and the (input) schema of
regulator relationship among a parameter and an its data consumer.
output attribute of an activity normally denotes
that some external data provider is employed in The mappings of the second type are internal to
order to derive a new attribute, through the the activity. Basically, they can be derived from the
respective parameter. LDL statement for each of the output/rejection
schemata. As far as the first and the third types of
Provider relationships: The flow of data from the provider relationships are concerned, the map-
data sources towards the data warehouse is pings must be provided during the construction of
performed through the composition of activities the ETL scenario. This means that they are either
in a larger scenario. In this context, the input for (a) by default assumed by the order of the
an activity can be either a persistent data store, or attributes of the involved schemata or (b) hard-
another activity. Usually, this applies for the coded by the user. Provider relationships are
output of an activity, too. We capture the passing depicted with bold solid arrows that stem from
of data from providers to consumers by a provider the provider and end in the consumer attribute.
11. ARTICLE IN PRESS
502 P. Vassiliadis et al. / Information Systems 30 (2005) 492–525
Observe Fig. 5. The flow starts from table relationships, previously discussed: the first rule
DS.PS1 of the data staging area. Each of the explains how the data from the DS.PS1 recordset
attributes of this table is mapped to an attribute of are fed into the input schema of the activity, the
the input schema of activity SK1. The attributes of second rule explains the semantics of activity (i.e.,
the input schema of the latter are subsequently how the surrogate key is generated) and, finally,
mapped to the attributes of the output schema of the third rule shows how the DW.PARTSUPP
the activity. The flow continues to DW.PARTSUPP. recordset is populated from the output schema of
Another interesting thing is that during the data the activity SK1.
flow, new attributes are generated, resulting on new
streams of data, whereas the flow seems to stop for Derived provider relationships: As we have
other attributes. Observe the rightmost part of already mentioned, there are certain output
Fig. 5 where the values of attribute PKEY are not attributes that are computed through the composi-
further propagated (remember that the reason for tion of input attributes and parameters. A derived
the application of a surrogate key transformation is provider relationship is another form of provider
to replace the production keys of the source data to relationship that captures the flow from the input
a homogeneous surrogate for the records of the to the respective output attributes.
data warehouse, which is independent of the source Formally, assume that (a) source is a term in
they have been collected from). Instead of the the architecture graph, (b) target is an attribute
values of the production key, the values from the of the output schema of an activity A and (c) x,y
attribute SKEY will be used to denote the unique are parameters in the parameter list of A (not
identifier for a part in the rest of the flow. necessary different). Then, a derived provider
In Fig. 6, we depict the LDL definition of this relationship pr(source, target) exists iff the
part of the motivating example. The three rules following regulator relationships (i.e., edges) exist:
correspond to the three categories of provider rr1(source, x) and rr2(y, target).
addSkey_in1(A_IN1_PKEY,A_IN1_DATE,A_IN1_QTY,A_IN1_COST,A_IN1_SOURCE)
ds_ps1(A_OUT_PKEY,A_OUT_DATE,A_OUT_QTY,A_OUT_COST,A_OUT_SOURCE),
A_OUT_PKEY=A_IN1_PKEY,
A_OUT_DATE=A_IN1_DATE,
A_OUT_QTY=A_IN1_QTY,
A_OUT_COST=A_IN1_COST,
A_OUT_SOURCE=A_IN1_SOURCE.
addSkey_out(A_OUT_PKEY,A_OUT_DATE,A_OUT_QTY,A_OUT_COST,A_OUT_SOURCE,A_OUT_SKEY)
addSkey_in1(A_IN1_PKEY,A_IN1_DATE,A_IN1_QTY,A_IN1_COST,A_IN1_SOURCE),
lookup(A_IN1_SOURCE,A_IN1_PKEY,A_OUT_SKEY),
A_OUT_PKEY=A_IN1_PKEY,
A_OUT_DATE=A_IN1_DATE,
A_OUT_QTY=A_IN1_QTY,
A_OUT_COST=A_IN1_COST,
A_OUT_SOURCE=A_IN1_SOURCE.
dw_partsupp(PKEY,DATE,QTY,COST,SOURCE)
addSkey_out(A_OUT_PKEY,A_OUT_DATE,A_OUT_QTY,A_OUT_COST,A_OUT_SOURCE,A_OUT_SKEY),
DATE=A_IN1_DATE,
QTY=A_IN1_QTY,
COST=A_IN1_COST
SOURCE=A_IN1_SOURCE,
PKEY=A_IN1_SKEY.
NOTE: For reasonsof readability we do not replace the Ain attribute names with
the activity name, i.e.,A_OUT_PKEYshould be diffPS1_OUT_PKEY.
Fig. 6. LDL specification of the motivating example.
12. ARTICLE IN PRESS
P. Vassiliadis et al. / Information Systems 30 (2005) 492–525 503
IN OUT
SK1
PAR
PKEY PKEY
SOURCE SOURCE
IN OUT
SK1
SKEY
PAR
PKEY
PKEY
SOURCE
SOURCE SKEY
PKEY
PKEY LPKEY
OUT
LOOKUP SOURCE
OUT
LOOKUP SOURCE LSOURCE
SKEY
SKEY LSKEY
Fig. 7. Derived provider relationships of the architecture graph: the original situation on the left and the derived provider relationships
on the right.
Intuitively, the case of derived relationships ships that do not involve constants (remember that
models the situation where the activity computes we have defined source as a term); (b) relation-
a new attribute in its output. In this case, the ships involving only attributes of the same/
produced output depends on all the attributes that different activity (as a measure of internal com-
populate the parameters of the activity, resulting plexity or external dependencies); (c) relationships
in the definition of the corresponding derived relating attributes that populate only the same
relationship. parameter (e.g., only the attributes LOOKUP.SKEY
Observe Fig. 7, where we depict a small part of and SK1.OUT.SKEY).
our running example. The left side of the figure
depicts the situation where only provider relation- 2.5. Scenarios
ships exist. The legend in the right side of Fig. 7
depicts how we compute the derived provider A scenario is an enumeration of activities along
relationships between the parameters of the with their source/target recordsets and the respec-
activity and the computed output attribute SKEY. tive provider relationships for each activity. An
The meaning of these five relationships is that ETL scenario consists of the following elements:
SK1.OUT.SKEY is not computed only from
attribute LOOKUP.SKEY, but from the combina- Name: A unique identifier for the scenario.
tion of all the attributes that populate the Activities: A finite list of activities. Note that by
parameters. employing a list (instead of e.g., a set) of
One can also assume different variations of activities, we impose a total ordering on the
derived provider relationships such as (a) relation- execution of the scenario.
13. ARTICLE IN PRESS
504 P. Vassiliadis et al. / Information Systems 30 (2005) 492–525
Entity Model-specific Scenario-specific
Data Types DI D
Built
-in
Function Types FI F
Constants CI C
Attributes ΩI Ω
Functions ΦI Φ
Schemata SI S
User-provided
RecordSets RSI RS
Activities AI A
Provider Relationships PrI Pr
Part-Of Relationships PoI Po
Instance-Of Relationships IoI Io
Regulator Relationships RrI Rr
Derived Provider Relationships DrI Dr
Fig. 8. Formal definition of domains and notation.
Recordsets: A finite set of recordsets. In the sequel, we treat the terms architecture
Targets: A special-purpose subset of the record- graph and scenario interchangeably. The reason-
sets of the scenario, which includes the final ing for the term ‘architecture graph’ goes all the
destinations of the overall process (i.e., the data way down to the fundamentals of conceptual
warehouse tables that must be populated by the modeling. As mentioned in [12], conceptual
activities of the scenario). models are the means by which designers conceive,
Provider relationships: A finite list of provider architect, design, and build software systems.
relationships among activities and recordsets of These conceptual models are used in the same
the scenario. way that blueprints are used in other engineering
disciplines during the early stages of the lifecycle of
In our modeling, a scenario is a set of activities, artificial systems, which involves the creation of
their architecture. The term ‘architecture graph’
deployed along a graph in an execution sequence
expresses the fact that the graph that we employ
that can be linearly serialized. For the moment, we
for the modeling of the data flow of the ETL
do not consider the different alternatives for the
scenario is practically acting as a blueprint of the
ordering of the execution; we simply require that a
architecture of this software artifact.
total order for this execution is present (i.e., each
Moreover, we assume the following integrity
activity has a discrete execution priority).
constraints for a scenario:
In terms of formal modeling of the architecture
graph, we assume the infinitely countable, mu-
Static constraints:
tually disjoint sets of names (i.e., the values of
which respect the unique name assumption) of All the weak entities of a scenario (i.e.,
column model-specific in Fig. 8. As far as a specific attributes or parameters) should be defined
scenario is concerned, we assume their respective within a part-of relationship (i.e., they should
finite subsets, depicted in column scenario-specific have a container object).
in Fig. 8. Data types, function types and constants All the mappings in provider relationships
are considered built-in’s of the system, whereas the should be defined among terms (i.e., attributes
rest of the entities are provided by the user (user or constants) of the same data type.
provided).
Formally, the architecture graph of an ETL
scenario is a graph G(V,E) defined as follows: Data flow constraints:
V ¼ D[F[C[X[/[S[RS[A All the attributes of the input schema(ta) of an
E ¼ Pr[Po[Io[Rr[Dr. activity should have a provider.
14. ARTICLE IN PRESS
P. Vassiliadis et al. / Information Systems 30 (2005) 492–525 505
Resulting from the previous requirement, if 3.1. General framework
some attribute is a parameter in an activity A,
the container of the attribute (i.e., recordset or Our philosophy during the construction of our
activity) should precede A in the scenario. metamodel was based on two pillars: (a) genericity,
All the attributes of the schemata of the target i.e., the derivation of a simple model, powerful to
recordsets should have a data provider. capture ideally all the cases of ETL activities and
(b) extensibility, i.e., the possibility of extending
Summarizing, in this section, we have presented the built-in functionality of the system with new,
a generic model for the modeling of the data flow user-specific templates.
for ETL workflows. In the next section, we will The genericity doctrine was pursued through the
proceed to detail how this generic model can be definition of a rather simple activity metamodel, as
accompanied by a customization mechanism, in described in Section 2. Still, providing a single
order to provide higher flexibility to the designer metaclass for all the possible activities of an ETL
of the workflow. environment is not really enough for the designer
of the overall process. A richer ‘‘language’’ should
be available, in order to describe the structure of
3. Templates for ETL activities the process and facilitate its construction. To this
end, we provide a palette of template activities,
In this section, we present the mechanism for which are specializations of the generic metamodel
exploiting template definitions of frequently used class.
ETL activities. The general framework for the Observe Fig. 9 for a further explanation of our
exploitation of these templates is accompanied framework. The lower layer of Fig. 9, namely
with the presentation of the language-related schema layer, involves a specific ETL scenario.
issues for template management and appropriate All the entities of the schema layer are instances of
examples. the classes Data Type, Function Type,
Datatypes Functions
Elementary Activity RecotdSet Relationships
Metamodel Layer
IsA
Domain Mismatch Source Table
Provider Re
NotNull SK Assignment Fact Table
Template Layer
InstanceOf
S1.PARTSUPF NN DM1 SK1 DW.PARTSUPP
Schema Layer
Fig. 9. The metamodel for the logical entities of the ETL environment.
15. ARTICLE IN PRESS
506 P. Vassiliadis et al. / Information Systems 30 (2005) 492–525
Elementary Activity, RecordSet and In the example of Fig. 9 the concept DW.
Relationship. Thus, as one can see on the PARTSUPP must be populated from a certain
upper part of Fig. 9, we introduce a meta-class source S1.PARTSUPP. Several operations must
layer, namely metamodel layer involving the intervene during the propagation. For instance in
aforementioned classes. The linkage between the Fig. 9, we check for null values and domain
metamodel and the schema layers is achieved violations, and we assign a surrogate key. As one
through instantiation (InstanceOf) relation- can observe, the recordsets that take part in this
ships. The metamodel layer implements the afore- scenario are instances of class RecordSet (be-
mentioned genericity desideratum: the classes longing to the metamodel layer) and specifically of
which are involved in the metamodel layer are its subclasses Source Table and Fact Table.
generic enough to model any ETL scenario, Instances and encompassing classes are related
through the appropriate instantiation. through links of type InstanceOf. The same
Still, we can do better than the simple provision mechanism applies to all the activities of
of a metalayer and an instance layer. In order to the scenario, which are (a) instances of class
make our metamodel truly useful for practi- Elementary Activity and (b) instances of
cal cases of ETL activities, we enrich it with a set one of its subclasses, depicted in Fig. 9. Relation-
of ETL-specific constructs, which constitute a ships do not escape this rule either. For instance,
subset of the larger metamodel layer, namely observe how the provider links from the concept
the template layer. The constructs in the template S1.PS toward the concept DW.PARTSUPP are
layer are also meta-classes, but they are related to class Provider Relationship
quite customized for the regular cases of ETL through the appropriate InstanceOf links.
activities. Thus, the classes of the template layer As far as the class Recordset is concerned, in
are specializations (i.e., subclasses) of the generic the template layer we can specialize it to several
classes of the metamodel layer (depicted as subclasses, based on orthogonal characteristics,
IsA relationships in Fig. 9). Through this custo- such as whether it is a file or RDBMS table, or
mization mechanism, the designer can pick the whether it is a source or target data store (as in
instances of the schema layer from a much Fig. 9). In the case of the class Relationship,
richer palette of constructs; in this setting, the there is a clear specialization in terms of the five
entities of the schema layer are instantiations, not classes of relationships which have already
only of the respective classes of the metamodel been mentioned in Section 2 (i.e., Provider,
layer, but also of their subclasses in the template Part-Of, Instance-Of, Regulator and
layer. Derived Provider).
Filters Unary operations Binary operations
- Selection (σ) - Push - Union (U)
- Not null (NN) - Aggregation (γ) - Join ( )
∆
∆
- Primary key - Projection (Π) - Diff (∆)
violation (PK) - Function application (f) - Update Detection (∆UPD)
- Foreign key - Surrogate key assignment (SK)
violation (FK) - Tuple normalization (N)
- Unique value (UN) - Tuple denormalization(DN)
- Domain mismatch (DM)
File operations Transfer operations
- EBCDIC to ASCII conversion - Ftp (FTP)
(EB2AS) - Compress / Decompress (Z/dZ)
- Sort file (Sort) - Encrypt / Decrypt (Cr/dCr)
Fig. 10. Template activities, along with their graphical notation symbols, grouped by category.
16. ARTICLE IN PRESS
P. Vassiliadis et al. / Information Systems 30 (2005) 492–525 507
Following the same framework, class Elemen- frequent elements of ETL scenarios. Moreover,
tary Activity is further specialized to an apart from this ‘‘built-in’’, ETL-specific extension
extensible set of reoccurring patterns of ETL of the generic metamodel, if the designer decides
activities, depicted in Fig. 10. As one can see on that several ‘patterns’, not included in the palette
the top side of Fig. 9, we group the template of the template layer, occur repeatedly in his data
activities in five major logical groups. We do not warehousing projects, he can easily fit them into
depict the grouping of activities in subclasses in the customizable template layer through a specia-
Fig. 9, in order to avoid overloading the figure; lization mechanism.
instead, we depict the specialization of class
Elementary Activity to three of its subclasses 3.2. Formal definition and usage of template
whose instances appear in the employed scenario activities
of the schema layer. We now proceed to present
each of the aforementioned groups in more detail. Once the template layer has been introduced,
The first group, named filters, provides checks the obvious issue that is raised is its linkage with
for the satisfaction (or not) of a certain condition. the employed declarative language of our frame-
The semantics of these filters are the obvious work. In general, the broader issue is the usage of
(starting from a generic selection condition the template mechanism from the user; to this end,
and proceeding to the check for null values, we will explain the substitution mechanism for
primary or foreign key violation, etc.). templates in this subsection and refer the interested
The second group of template activities is called reader to [13] for a presentation of the specific
unary operations and except for the most generic templates that we have constructed.
push activity (which simply propagates data from A template activity is formally defined by the
the provider to the consumer), consists of the following elements:
classical aggregation and function appli-
cation operations along with three data ware- Name: A unique identifier for the template
house specific transformations (surrogate key activity.
assignment, normalization and denorma- Parameter list: A set of names which act as
lization). The third group consists of classical regulators in the expression of the semantics of
binary operations, such as union, join and the template activity. For example, the para-
difference of recordsets/activities as well as meters are used to assign values to constants,
with a special case of difference involving the create dynamic mapping at instantiation time,
detection of updates. Except for the afore- etc.
mentioned template activities, which mainly refer Expression: A declarative statement describing
to logical transformations, we can also consider the operation performed by the instances of the
the case of physical operators that refer to the template activity. As with elementary activities,
application of physical transformations to whole our model supports LDL as the formalism for
files/tables. In the ETL context, we are mainly the expression of this statement.
interested in operations like transfer operations Mapping: A set of bindings, mapping input to
(ftp, compress/decompress, encrypt/ output attributes, possibly through intermediate
decrypt) and file operations (EBCDIC to AS- placeholders. In general, mappings at the
CII, sort file). template level try to capture a default way of
Summarizing, the metamodel layer is a set of propagating incoming values from the input
generic entities, able to represent any ETL towards the output schema. These default
scenario. At the same time, the genericity of the bindings are easily refined and possibly rear-
metamodel layer is complemented with the exten- ranged at instantiation time.
sibility of the template layer, which is a set of
‘‘built-in’’ specializations of the entities of the The template mechanism we use is a substitution
metamodel layer, specifically tailored for the most mechanism, based on macros, that facilitates the
17. ARTICLE IN PRESS
508 P. Vassiliadis et al. / Information Systems 30 (2005) 492–525
automatic creation of LDL code. This simple which returns the arity of the respective schema,
notation and instantiation mechanism permits the mainly in order to define upper bounds in loop
easy and fast registration of LDL templates. In the iterators.
rest of this section, we will elaborate on the
notation, instantiation mechanisms and template Loops: Loops are a powerful mechanism that
taxonomy particularities. enhances the genericity of the templates by
allowing the designer to handle templates with
3.2.1. Notation unknown number of variables and with unknown
Our template notation is a simple language arity for the input/output schemata. The general
featuring five main mechanisms for dynamic form of loops is
production of LDL expressions: (a) variables that ½hsimple constraintiŠfhloop bodyig;
are replaced by their values at instantiation
time; (b) a function that returns the arity of an where simple constraint has the form:
input, output or parameter schema; (c) loops, hlower boundi hcomparison operatori hiteratori
where the loop body is repeated at instantiation
hcomparison operatori hupper boundi:
time as many times as the iterator constraint
defines; (d) keywords to simplify the creation We consider only linear increase with step equal
of unique predicate and attribute names; and, to 1, since this covers most possible cases. Upper
finally, (e) macros which are used as syntactic bound and lower bound can be arithmetic
sugar to simplify the way we handle complex expressions involving arityOf() function calls,
expressions (especially in the case of variable size variables and constants. Valid arithmetic opera-
schemata). tors are +, À, /, * and valid comparison operators
are o, 4, ¼ , all with their usual semantics. If
Variables: We have two kinds of variables in the lower bound is omitted, 1 is assumed. During each
template mechanism: parameter variables and loop iteration the loop body will be reproduced and at
iterators. Parameter variables are marked with a @ the same time all the marked appearances of the
symbol at their beginning and they are replaced by loop iterator will be replaced by its current value,
user-defined values at instantiation time. A list of as described before. Loop nesting is permitted.
an arbitrary length of parameters is denoted by
@/parameter nameS[ ]. For such lists, the Keywords: Keywords are used in order to refer
user has to explicitly or implicitly provide their to input and output schemata. They provide two
length at instantiation time. Loop iterators, on the main functionalities: (a) they simplify the reference
other hand, are implicitly defined in the loop to the input output/schema by using standard
constraint. During each loop iteration, all the names for the predicates and their attributes, and
properly marked appearances of the iterator in the (b) they allow their renaming at instantiation time.
loop body are replaced by its current value This is done in such a way that no different
(similarly to the way the C preprocessor treats predicates with the same name will appear in the
#DEFINE statements). Iterators that appear same program, and no different attributes with the
marked in loop body are instantiated even when same name will appear in the same rule. Keywords
they are a part of another string or of a variable are recognized even if they are parts of another
name. We mark such appearances by enclosing string, without a special notation. This facilitates a
them with $. This functionality enables referencing homogenous renaming of multiple distinct input
all the values of a parameter list and facilitates the schemata at template level, to multiple distinct
creation of an arbitrary number of pre-formatted schemata at instantiation, with all of them having
strings. unique names in the LDL program scope. For
example, if the template is expressed in terms of
Functions: We employ a built-in function, ari- two different input schemata a_in1 and a_in2,
tyOf(/input/output/parameter schemaS ), at instantiation time they will be renamed to
18. ARTICLE IN PRESS
P. Vassiliadis et al. / Information Systems 30 (2005) 492–525 509
Keyword Usage Example
A unique name for the output/input schema
a_out of the activity. The predicate that is difference3_out
produced when this template is instantiated
a_in has the form: difference3_in
unique_pred_name_out (or, _in respectively)
A_OUT/A_IN is used for constructing the names
of the a_out/a_in attributes. The names
A_OUT DIFFERENCE3_OUT
produced have the form:
A_IN DIFFERENCE3_IN
predicate unique name in upper case_OUT
(or, _IN respectively)
Fig. 11. Keywords for templates.
dm1_in1 and dm1_in2 so that the produced definition of a template for a simple relational
names will be unique throughout the scenario selection:
program. In Fig. 11, we depict the way the
a_out([ioarityOf(a_out)]{A_OUT_$i$,}
renaming is performed at instantiation time.
[i ¼ arityOf(a_out)]{A_OUT_$i$}) o-
a_in1([ioarityOf(a_in1)]
Macros: To make the definition of templates
{A_IN1_$i$,} [i ¼ arityOf(a_in1)]
easier and to improve their readability, we
{A_IN1_$i$}),
introduce a macro to facilitate attribute and
expr([ioarityOf(@PARAM)]
variable name expansion. For example, one of
{@PARAM[$i$],}
the major problems in defining a language for
[i ¼ arityOf(@PARAM)]
templates is the difficulty of dealing with schemata
{@PARAM[$i$]}),
of arbitrary arity. Clearly, at the template level, it
[ioarityOf(a_out)]
is not possible to pin-down the number of
{A_OUT_$i$ ¼ A_IN1_$i$,}
attributes of the involved schemata to a specific
[i ¼ arityOf(a_out)]
value. For example, in order to create a series of
{A_OUT_$i$ ¼ A_IN1_$i$}
names like the following
As already mentioned at the syntax for loops, the
name_theme_1,name_theme_2,y, expression
name_theme_k
[ioarityOf(a_out)]{A_OUT_$i$,}
we need to give the following expression: [i ¼ arityOf(a_out)]{A_OUT_$i$}
[iteratoromaxLimit] defining the attributes of the output schema
{name_theme$iterator$} a_out simply wants to list a variable number of
[iterator ¼ maxLimit] attributes that will be fixed at instantiation time.
{name_theme$iterator$} Exactly the same tactics apply for the attributes of
the predicate names a_in1 and expr. Also, the
Obviously, this results in making the writing of final two lines state that each attribute of the
templates hard and reduces their readability. To output will be equal to the respective attribute of
attack this problem, we resort to a simple reusable the input (so that the query is safe), e.g.,
macro mechanism that enables the simplification A_OUT_4 ¼ A_IN1_4. We can simplify the
of employed expressions. For example, observe the definition of the template by allowing the designer
19. ARTICLE IN PRESS
510 P. Vassiliadis et al. / Information Systems 30 (2005) 492–525
to define certain macros that simplify the manage- 4. All the rest parameter variables are instantiated.
ment of temporary length attribute lists. We 5. Keywords are recognized and renamed.
employ the following macros:
We will try to explain briefly the intuition
DEFINE INPUT_SCHEMA AS
behind this execution order. Macros are expanded
[ioarityOf(a_in1)]{A_IN1_$i$,}
first. Step (2) proceeds step (3) because loop
[i ¼ arityOf(a_in1)] {A_IN1_$i$}
boundaries have to be calculated before loop
DEFINE OUTPUT_SCHEMA AS
productions are performed. Loops on the other
[ioarityOf(a_in)]{A_OUT_$i$,}
hand, have to be expanded before parameter
[i ¼ arityOf(a_out)]{A_OUT_$i$}
variables are instantiated, if we want to be able
DEFINE PARAM_SCHEMA AS to reference lists of variables. The only exception
[ioarityOf(@PARAM)]{@PARAM[$i$],}
to this is the parameter variables that appear in the
[i ¼ arityOf(@PARAM)]{@PARAM[$i$]}
loop boundaries, which have to be calculated first.
DEFINE DEFAULT_MAPPING AS
Notice though, that variable list elements cannot
[ioarityOf(a_out)]
appear in the loop constraint. Finally, we have to
{A_OUT_$i$ ¼ A_IN1_$i$,}
instantiate variables before keywords since vari-
[i ¼ arityOf(a_out)]
ables are used to create a dynamic mapping
{A_OUT_$i$ ¼ A_IN1_$i$}
between the input/output schemata and other
attributes.
Then, the template definition is as follows:
Fig. 12 shows a simple example of template
a_out(OUTPUT_SCHEMA) o- instantiation for the function application activity.
a_in1(INPUT_SCHEMA), To understand the overall process better, first
expr(PARAM_SCHEMA), observe the outcome of it, i.e., the specific activity
DEFAULT_MAPPING which is produced, as depicted in the final row of
Fig. 12, labeled keyword renaming. The output
schema of the activity, fa12_out, is the head of
the LDL rule that specifies the activity. The body
3.2.2. Instantiation
of the rule says that the output records are
Template instantiation is the process where the
specified by the conjunction of the following
user chooses a certain template and creates a
clauses: (a) the input schema myFunc_in, (b)
concrete activity out of it. This procedure requires
the application of function subtract over the
that the user specifies the schemata of the activity
attributes COST_IN, PRICE_IN and the produc-
and gives concrete values to the template para-
tion of a value PROFIT, and (c) the mapping of
meters. Then, the process of producing the
respective LDL description of the activity is easily the input to the respective output attributes as
specified in the last three conjuncts of the rule.
automated. Instantiation order is important in our
The first row, template, shows the initial
template creation mechanism, since, as it can easily
template as it has been registered by the designer.
been seen from the notation definitions, different
@FUNCTION holds the name of the function to be
orders can lead to different results. The instantia-
used, subtract in our case, and the @PARAM[ ]
tion order is as follows:
holds the inputs of the function, which in our case
1. Replacement of macro definitions with their are the two attributes of the input schema. The
expansions. problem we have to face is that all input, output
2. arityOf() functions and parameter variables and function schemata have a variable number of
appearing in loop boundaries are calculated parameters. To abstract from the complexity of
first. this problem, we define four macro definitions, one
3. Loop productions are performed by instantiat- for each schema (INPUT_SCHEMA, OUTPUT_
ing the appearances of the iterators. This leads SCHEMA, FUNCTION_INPUT) along with a macro
to intermediate results without any loops. for the mapping of input to output attributes
20. ARTICLE IN PRESS
P. Vassiliadis et al. / Information Systems 30 (2005) 492–525 511
Fig. 12. Instantiation procedure.
(DEFAULT_MAPPING). The second row, macro also shown in the last two lines of the template. In
expansion, shows how the template looks after the the third row, parameter instantiation, we can see
macros have been incorporated in the template how the parameter variables were materialized at
definition. The mechanics of the expansion are instantiation. In the fourth row, loop production,
straightforward: observe how the attributes of the we can see the intermediate results after the loop
output schema are specified by the expression expansions are done. As it can easily be seen, these
[ioarityOf(a_in)+1]{A_OUT_$i$,}OUT- expansions must be done before @PARAM[]
FIELD as an expansion of the macro OUTPUT_ variables are replaced by their values. In the fifth
SCHEMA. In a similar fashion, the attributes of the row, variable instantiation, the parameter variables
input schema and the parameters of the function have been instantiated creating a default mapping
are also specified; note that the expression for the between the input, the output and the function
last attribute in the list is different (to avoid attributes. Finally, in the last row, keyword
repeating an erroneous comma). The mappings renaming, the output LDL code is presented after
between the input and the output attributes are the keywords are renamed. Keyword instantiation