In traditional data workflows, we often encounter what are known as "task-oriented" interfaces and tools. These interfaces focus primarily on orchestrating and managing data tasks or jobs. Engineers use task-oriented tools like Airflow, Luigi, and Prefect to model dependencies between jobs or tasks. This approach has been more about managing, monitoring, and operating jobs than creating and managing data sets. However, a recent trend in the industry is shifting towards a more data-oriented approach. This means that instead of concentrating on the jobs and tasks (processes), organizations are now emphasizing the data sets (end products) themselves. With this data-oriented approach, jobs and tasks are still essential, but they are considered implementation details. This means that while they remain crucial in creating the data sets, they are no longer the main focus. The primary focus is now on enabling data engineers to develop data sets (data models or assets) instead of jobs. And ensuring that the data sets being generated are of high quality, easily accessible, and well-documented. The quick rise of tools like Mage.ai, Airbyte, and Dagster highlights this trend. The presentation will include a demonstration of various data-oriented tools, re-emphasizing the value of focusing on data rather than the tasks that manipulate it.
9. 9
ETL Scenario: analyse review efficiency
How long does it take for a pull request to be merged? (Review
efficiency)
Which components take most time or most comments to be merged
10. 10
ETL Scenario: analyse review efficiency
Step 1: Discover the data
Sources:
Github data from Github repository using Github API
o Data about Pull Requests
o Users
o Pull Request Comments
11. 11
ETL Scenario: analyse review efficiency
Step 2: Integrate (ingest) the data
Sources:
Github data from Github repository using Github API
o Data about Pull Requests
o Users
o Pull Request Comments
15. 15
ETL Scenario: analyse review efficiency
Step 2: Integrate (ingest) the data
Bookmark this page to try this at homebit.ly/vdk-ingest
16. 16
ETL Scenario: analyse review efficiency
Step 3: Transform the data (dimensional data modeling)
Identify the Dimensions
Identify the Facts
Date
Component
Fact Table (fact_pr_merged)
Reminder:
How long does it take for a pull request to be merged? (Review efficiency)
How many comments it take before pull request is merged
In traditional data workflows, we often encounter what are known as "task-centric" interfaces and tools.These interfaces focus primarily on orchestrating and managing data tasks or jobs. The user is responsible for writing the code necessary for ingesting and transforming the data, as well as writing the code orchestrating the series of tasks required to process that data.
But there is a shift in the data engineering ecosystem towards data-centric design, where data users can focus more on their primary goal – effectively using and managing data, and less on managing the minutiae of data jobs or pipelines. This approach can increase productivity and shift the focus towards ensuring the quality and utility of the resulting data sets. We will show this shift in this presentation.
We'll be delving into the world of data engineering and how we're witnessing a shift from task-centric to data-centric interfaces and design.But before we discuss this transition, let's begin with the foundation - the Data Engineering Cycle.
The Data Engineering Cycle forms the backbone of how we handle data in an organization. This process is cyclical and continuous, comprising several key stages:
Discover Data: This is always the first step, where we identify and understand the data that's available to us. We look for data both within and outside the organization that can provide valuable insights.
Integrate Data: Once we've identified relevant data, we bring it together, often from disparate sources, into a central repository or a data warehouse, or (in case of POC) on our local machine.
Transform Data: This stage involves cleaning, structuring, and enriching, modeling the data. We apply various transformations to convert raw data into a format suitable for analysis and interpretation.
Deploy (or Publish) Data: After transformation, the data is made available to other systems, teams, or individuals who need it. It might be published in a data warehouse, a data mart, or exposed through APIs.
Manage Data: The final stage involves the ongoing management of data. This includes monitoring data flows, troubleshooting issues, ensuring data quality, and meeting performance standards.
In each of these stages, the role of a data engineer is critical. However, the way we approach these steps is changing, with a greater focus on the data itself rather than the tasks or jobs that manipulate it. And that's what we'll explore in the rest of this presentation. Stay tuned!
In traditional data workflows, we often encounter what are known as "task-centric" interfaces.These interfaces focus primarily on orchestrating and managing data tasks or jobs. The user, represented here, is responsible for writing the code necessary for ingesting and transforming the data, as well as orchestrating the series of tasks required to process that data.
The user writes code that interacts with an Orchestration abstraction (or API). This could be things like data jobs, tasks, directed acyclic graphs (DAGs), workflows. The goal of the user is to ensure the orchestration of these tasks effectively to achieve the desired data transformation and analysis.
To better illustrate the task-centric approach, let's consider an example of Apache Airflow.In Airflow, workflows are defined as Directed Acyclic Graphs (or DAGs), which are a set of tasks executed in a specific order. Each node in the DAG represents a task, and the edges define dependencies amongst the tasks. Tasks are orchestrated and executed by the Airflow scheduler based on their dependencies.
This task-centric interface often requires the user to spend considerable time and effort managing and monitoring these tasks and dependencies, instead of focusing on the data itself.
However, while this approach offers robust control over data processing, it necessitates a lot of manual work and a significant focus on task management, often detracting from the main goal of data users: effectively using and managing data.
Now, let's look at the shift towards the data-centric model. Here, the user's focus moves away from the details of task orchestration to a more high-level, often declarative, representation of the data workflows. The user writes code to describe the data transformations and their outcomes, focusing on the resultant datasets rather than the steps of their creation.
This code interacts with something we call for now Data API (data abstractions). The interfaces in this layer are centered around data-centric constructs like models, sources, destinations, templates, or assets, essentially representing the end products of data transformations.
Under the hood, the orchestration logic still exists, and tasks and jobs are still used. However, this complexity is abstracted away from the user, allowing them to focus on the data itself.
This transition exemplifies the move towards data-centric design, where data users can focus more on their primary goal – effectively using and managing data, and less on managing the minutiae of data jobs or pipelines. This approach can increase productivity and shift the focus towards ensuring the quality and utility of the resulting data sets.
Main interfaces of DBT are:
Models: These are SQL select statements that define transformations on your data. They represent the core logic for transforming raw data into a useful form for analysis. Models are stored as .sql files in the DBT project.
Sources: These are references to raw data in your warehouse. Instead of referencing the raw data by its table name, you reference a source defined in your DBT project. This provides a level of abstraction and allows DBT to perform additional checks on the raw data.
DBT uses the information from the models, sources, and potentially other configurations (like tests) to generate a directed acyclic graph (DAG) which is a representation of the order of operations (the orchestration logic). This is where DBT shines: based on the dependencies it infers from your SQL models, it knows which transformations depend on others, and it will automatically run these transformations in the correct order and it will create them in a way that they can be optimial
References: - https://www.youtube.com/watch?v=Y03CsVDK69Y
- https://docs.getdbt.com/faqs/project/example-projects
Airflow: 1 - It's a task scheduler and is mostly used for managing and scheduling complex workflows, making it task-centric.
Luigi: 1 - Similar to Airflow, Luigi is also used for creating complex pipelines of batch jobs, putting it in the task-centric category.
Argo: 2 - While Argo is a container-native workflow engine for Kubernetes, which can be used to manage complex data workflows, its ability to handle data directly is somewhat limited but allows for storing data in PVC and passing it to other tasks
Prefect -3 Prefect is an improvement over Airflow and Luigi in managing complex workflows with a Pythonic API. Tasks can exchange data, and the output of one task can be used as the input to another
Dagster 4 – Dagster introduce asset oriented development which builds on top of their previous task oriented approach . Interestingly 2021 Dagster likely would have been on the left but they have focused a lot of effort on what they call “Software defined assets”
Airbyte: 5 - Airbyte is an open-source data integration platform that syncs data from applications, APIs, and databases to data warehouses. Similarly Fivetren, Segment
DBT -5 dbt is primarily a data modeling tool. It focuses on transforming data inside your data warehouse, making it a very data-centric tool
Great Expectations: 5 It's a tool that helps data teams eliminate pipeline debt, through data testing, documentation, and profiling, so it's more data-centric
The first concept is data sources
Then you define your destinations
And ifnally you specify the flow from source to destination
Transaction Fact Tables for individual actions (e.g., FactPullRequests)
Accumulating Snapshot Fact Table as we are interested in analyzing the lifecycle of pull requests from open to close.
Dimenstions: Dimension date
Dimenstion users
Dagster's approach, , puts the emphasis on the data itself—what it represents (assets) and the instances of its creation or transformation (materializations). This makes Dagster particularly well-suited for scenarios where understanding and managing the data flow is more complex or more critical than the specific processing steps.
data-centric interfaces like assets and materializations, sources, destinations, flows offer a different perspective on building and managing data pipelines, focusing on the data and its lineage rather than just the processing steps. This approach can lead to clearer, more maintainable, and more robust data pipelines, especially in complex or data-intensive environments