[DSC Europe 23] Antoni Ivanov - Make data central feature

•Download as PPTX, PDF•

0 likes•6 views

In traditional data workflows, we often encounter what are known as "task-oriented" interfaces and tools. These interfaces focus primarily on orchestrating and managing data tasks or jobs. Engineers use task-oriented tools like Airflow, Luigi, and Prefect to model dependencies between jobs or tasks. This approach has been more about managing, monitoring, and operating jobs than creating and managing data sets. However, a recent trend in the industry is shifting towards a more data-oriented approach. This means that instead of concentrating on the jobs and tasks (processes), organizations are now emphasizing the data sets (end products) themselves. With this data-oriented approach, jobs and tasks are still essential, but they are considered implementation details. This means that while they remain crucial in creating the data sets, they are no longer the main focus. The primary focus is now on enabling data engineers to develop data sets (data models or assets) instead of jobs. And ensuring that the data sets being generated are of high quality, easily accessible, and well-documented. The quick rise of tools like Mage.ai, Airbyte, and Dagster highlights this trend. The presentation will include a demonstration of various data-oriented tools, re-emphasizing the value of focusing on data rather than the tasks that manipulate it.

Data & Analytics

Make data central
feature
Antoni Ivanov
Open Source / VMware

2
Discover
data
Integrate
data
Transform
data
Deploy
data
Manage
data
Respond to changing requirements within single business day
Data Engineering Cycle

4
Task-centric interfaces
Airflow as an example
Example image based on https://github.com/write4alive/Data-Pipelines-with-Apache-Airflow

7
Data-centric interfaces
bit.ly/vdk-dbt-blog
DBT as an example
Sources Models

8
Data-centric interfaces
Continuum Diagram of Open Source projects

9
ETL Scenario: analyse review efficiency
 How long does it take for a pull request to be merged? (Review
efficiency)
 Which components take most time or most comments to be merged

10
ETL Scenario: analyse review efficiency
Step 1: Discover the data
Sources:
 Github data from Github repository using Github API
o Data about Pull Requests
o Users
o Pull Request Comments

11
ETL Scenario: analyse review efficiency
Step 2: Integrate (ingest) the data
Sources:
 Github data from Github repository using Github API
o Data about Pull Requests
o Users
o Pull Request Comments

15
ETL Scenario: analyse review efficiency
Step 2: Integrate (ingest) the data
Bookmark this page to try this at homebit.ly/vdk-ingest

16
ETL Scenario: analyse review efficiency
Step 3: Transform the data (dimensional data modeling)
Identify the Dimensions
Identify the Facts
Date
Component
Fact Table (fact_pr_merged)
Reminder:
How long does it take for a pull request to be merged? (Review efficiency)
How many comments it take before pull request is merged

19
Data-centric interfaces Examples Summary
 Data Sources
 Data Destinations
 Data Flows
 Data Assets
 Data Materializations

Thank You
https://www.linkedin.com/in/antoni-ivanov
Please take this Survey:

Similar to [DSC Europe 23] Antoni Ivanov - Make data central feature

Stefaan Ponnet, Fuseboxnascomgenk

ADF+Course+Deck.pdfChiquteRobledo

Organizing the Data Chaos of ScientistsAndreas Schreiber

ZZ BC#7.5 asp.net mvc practice and guideline refresh! Chalermpon Areepong

Innovate2014 Better Integrations Through Open InterfacesSteve Speicher

Monitoring web application response times, a new approachMark Friedman

DataFinder: A Python Application for Scientific Data ManagementAndreas Schreiber

Sagnik_AnalytixLabs_ProjectsSagnik Jena

Ladies Be Architects - Integration - Multi-Org, Security, JSON, Backup & Restoregemziebeth

Azure DevOps for DevelopersSarah Dutkiewicz

xldb2012_wed_0950_TimFrazierTim Frazier

Linked Data for Information Extraction Challenge - Tasks and Results @ ISWC 2014Robert Meusel

Module 1: ConfD Technical IntroductionTail-f Systems

70487.pdfKaren Benoit

The Enterprise Guide to Building a Data Mesh - Introducing SpecMeshIanFurlong4

Trinug - repository patternBhuvnesh Bhatt

ADO.NET Data Servicesukdpe

File Repository on GAElynneblue

2015 Q4 webrtc standards updateAlexandre Gouaillard

About Flink streaming용휘 김

Similar to [DSC Europe 23] Antoni Ivanov - Make data central feature (20)

Stefaan Ponnet, Fusebox

ADF+Course+Deck.pdf

Organizing the Data Chaos of Scientists

ZZ BC#7.5 asp.net mvc practice and guideline refresh!

Innovate2014 Better Integrations Through Open Interfaces

Monitoring web application response times, a new approach

DataFinder: A Python Application for Scientific Data Management

Sagnik_AnalytixLabs_Projects

Ladies Be Architects - Integration - Multi-Org, Security, JSON, Backup & Restore

Azure DevOps for Developers

xldb2012_wed_0950_TimFrazier

Linked Data for Information Extraction Challenge - Tasks and Results @ ISWC 2014

Module 1: ConfD Technical Introduction

70487.pdf

The Enterprise Guide to Building a Data Mesh - Introducing SpecMesh

Trinug - repository pattern

ADO.NET Data Services

File Repository on GAE

2015 Q4 webrtc standards update

About Flink streaming

Recently uploaded

VidaXL dropshipping via API with DroFx.pptxolyaivanovalion

Capstone Project on IBM Data Analytics ProgramMoniSankarHazra

Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls

Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls

FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg

ALSO dropshipping via API with DroFx.pptxolyaivanovalion

BigBuy dropshipping via API with DroFx.pptxolyaivanovalion

꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor

Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823

Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls

Sampling (random) method and Non random.pptDr. Soumendra Kumar Patra

April 2024 - Crypto Market Report's Analysismanisha194592

Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh9953056974 Low Rate Call Girls In Saket, Delhi NCR

Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal

Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls

Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H

CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion

Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823

BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692

Recently uploaded (20)

VidaXL dropshipping via API with DroFx.pptx

Capstone Project on IBM Data Analytics Program

Best VIP Call Girls Noida Sector 39 Call Me: 8448380779

Best VIP Call Girls Noida Sector 22 Call Me: 8448380779

FESE Capital Markets Fact Sheet 2024 Q1.pdf

ALSO dropshipping via API with DroFx.pptx

BigBuy dropshipping via API with DroFx.pptx

꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130

Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...

Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...

Sampling (random) method and Non random.ppt

April 2024 - Crypto Market Report's Analysis

Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh

Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure

Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night

Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf

CebaBaby dropshipping via API with DroFX.pptx

Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...

BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx

[DSC Europe 23] Antoni Ivanov - Make data central feature

1. Make data central feature Antoni Ivanov Open Source / VMware

2. 2 Discover data Integrate data Transform data Deploy data Manage data Respond to changing requirements within single business day Data Engineering Cycle

3. 3 Task-centric interfaces

4. 4 Task-centric interfaces Airflow as an example Example image based on https://github.com/write4alive/Data-Pipelines-with-Apache-Airflow

5. 5 Task-centric interfaces

6. 6 Data-centric interfaces

7. 7 Data-centric interfaces bit.ly/vdk-dbt-blog DBT as an example Sources Models

8. 8 Data-centric interfaces Continuum Diagram of Open Source projects

9. 9 ETL Scenario: analyse review efficiency  How long does it take for a pull request to be merged? (Review efficiency)  Which components take most time or most comments to be merged

10. 10 ETL Scenario: analyse review efficiency Step 1: Discover the data Sources:  Github data from Github repository using Github API o Data about Pull Requests o Users o Pull Request Comments

11. 11 ETL Scenario: analyse review efficiency Step 2: Integrate (ingest) the data Sources:  Github data from Github repository using Github API o Data about Pull Requests o Users o Pull Request Comments

12. 12

13. 13

14. 14

15. 15 ETL Scenario: analyse review efficiency Step 2: Integrate (ingest) the data Bookmark this page to try this at homebit.ly/vdk-ingest

16. 16 ETL Scenario: analyse review efficiency Step 3: Transform the data (dimensional data modeling) Identify the Dimensions Identify the Facts Date Component Fact Table (fact_pr_merged) Reminder: How long does it take for a pull request to be merged? (Review efficiency) How many comments it take before pull request is merged

17. 17

18. 18

19. 19 Data-centric interfaces Examples Summary  Data Sources  Data Destinations  Data Flows  Data Assets  Data Materializations

20. Thank You https://www.linkedin.com/in/antoni-ivanov Please take this Survey:

Editor's Notes

In traditional data workflows, we often encounter what are known as "task-centric" interfaces and tools.These interfaces focus primarily on orchestrating and managing data tasks or jobs. The user is responsible for writing the code necessary for ingesting and transforming the data, as well as writing the code orchestrating the series of tasks required to process that data. But there is a shift in the data engineering ecosystem towards data-centric design, where data users can focus more on their primary goal – effectively using and managing data, and less on managing the minutiae of data jobs or pipelines. This approach can increase productivity and shift the focus towards ensuring the quality and utility of the resulting data sets. We will show this shift in this presentation.
We'll be delving into the world of data engineering and how we're witnessing a shift from task-centric to data-centric interfaces and design.But before we discuss this transition, let's begin with the foundation - the Data Engineering Cycle. The Data Engineering Cycle forms the backbone of how we handle data in an organization. This process is cyclical and continuous, comprising several key stages: Discover Data: This is always the first step, where we identify and understand the data that's available to us. We look for data both within and outside the organization that can provide valuable insights. Integrate Data: Once we've identified relevant data, we bring it together, often from disparate sources, into a central repository or a data warehouse, or (in case of POC) on our local machine. Transform Data: This stage involves cleaning, structuring, and enriching, modeling the data. We apply various transformations to convert raw data into a format suitable for analysis and interpretation. Deploy (or Publish) Data: After transformation, the data is made available to other systems, teams, or individuals who need it. It might be published in a data warehouse, a data mart, or exposed through APIs. Manage Data: The final stage involves the ongoing management of data. This includes monitoring data flows, troubleshooting issues, ensuring data quality, and meeting performance standards. In each of these stages, the role of a data engineer is critical. However, the way we approach these steps is changing, with a greater focus on the data itself rather than the tasks or jobs that manipulate it. And that's what we'll explore in the rest of this presentation. Stay tuned!
In traditional data workflows, we often encounter what are known as "task-centric" interfaces.These interfaces focus primarily on orchestrating and managing data tasks or jobs. The user, represented here, is responsible for writing the code necessary for ingesting and transforming the data, as well as orchestrating the series of tasks required to process that data. The user writes code that interacts with an Orchestration abstraction (or API). This could be things like data jobs, tasks, directed acyclic graphs (DAGs), workflows. The goal of the user is to ensure the orchestration of these tasks effectively to achieve the desired data transformation and analysis.
To better illustrate the task-centric approach, let's consider an example of Apache Airflow.In Airflow, workflows are defined as Directed Acyclic Graphs (or DAGs), which are a set of tasks executed in a specific order. Each node in the DAG represents a task, and the edges define dependencies amongst the tasks. Tasks are orchestrated and executed by the Airflow scheduler based on their dependencies. This task-centric interface often requires the user to spend considerable time and effort managing and monitoring these tasks and dependencies, instead of focusing on the data itself. However, while this approach offers robust control over data processing, it necessitates a lot of manual work and a significant focus on task management, often detracting from the main goal of data users: effectively using and managing data.
Now, let's look at the shift towards the data-centric model. Here, the user's focus moves away from the details of task orchestration to a more high-level, often declarative, representation of the data workflows. The user writes code to describe the data transformations and their outcomes, focusing on the resultant datasets rather than the steps of their creation. This code interacts with something we call for now Data API (data abstractions). The interfaces in this layer are centered around data-centric constructs like models, sources, destinations, templates, or assets, essentially representing the end products of data transformations. Under the hood, the orchestration logic still exists, and tasks and jobs are still used. However, this complexity is abstracted away from the user, allowing them to focus on the data itself. This transition exemplifies the move towards data-centric design, where data users can focus more on their primary goal – effectively using and managing data, and less on managing the minutiae of data jobs or pipelines. This approach can increase productivity and shift the focus towards ensuring the quality and utility of the resulting data sets.
Main interfaces of DBT are: Models: These are SQL select statements that define transformations on your data. They represent the core logic for transforming raw data into a useful form for analysis. Models are stored as .sql files in the DBT project. Sources: These are references to raw data in your warehouse. Instead of referencing the raw data by its table name, you reference a source defined in your DBT project. This provides a level of abstraction and allows DBT to perform additional checks on the raw data. DBT uses the information from the models, sources, and potentially other configurations (like tests) to generate a directed acyclic graph (DAG) which is a representation of the order of operations (the orchestration logic). This is where DBT shines: based on the dependencies it infers from your SQL models, it knows which transformations depend on others, and it will automatically run these transformations in the correct order and it will create them in a way that they can be optimial References: - https://www.youtube.com/watch?v=Y03CsVDK69Y - https://docs.getdbt.com/faqs/project/example-projects
Airflow: 1 - It's a task scheduler and is mostly used for managing and scheduling complex workflows, making it task-centric. Luigi: 1 - Similar to Airflow, Luigi is also used for creating complex pipelines of batch jobs, putting it in the task-centric category. Argo: 2 - While Argo is a container-native workflow engine for Kubernetes, which can be used to manage complex data workflows, its ability to handle data directly is somewhat limited but allows for storing data in PVC and passing it to other tasks Prefect -3 Prefect is an improvement over Airflow and Luigi in managing complex workflows with a Pythonic API. Tasks can exchange data, and the output of one task can be used as the input to another Dagster 4 – Dagster introduce asset oriented development which builds on top of their previous task oriented approach . Interestingly 2021 Dagster likely would have been on the left but they have focused a lot of effort on what they call “Software defined assets” Airbyte: 5 - Airbyte is an open-source data integration platform that syncs data from applications, APIs, and databases to data warehouses. Similarly Fivetren, Segment DBT -5 dbt is primarily a data modeling tool. It focuses on transforming data inside your data warehouse, making it a very data-centric tool Great Expectations: 5 It's a tool that helps data teams eliminate pipeline debt, through data testing, documentation, and profiling, so it's more data-centric
The first concept is data sources
Then you define your destinations
And ifnally you specify the flow from source to destination
Transaction Fact Tables for individual actions (e.g., FactPullRequests) Accumulating Snapshot Fact Table as we are interested in analyzing the lifecycle of pull requests from open to close. Dimenstions: Dimension date Dimenstion users
Dagster's approach, , puts the emphasis on the data itself—what it represents (assets) and the instances of its creation or transformation (materializations). This makes Dagster particularly well-suited for scenarios where understanding and managing the data flow is more complex or more critical than the specific processing steps.
data-centric interfaces like assets and materializations, sources, destinations, flows offer a different perspective on building and managing data pipelines, focusing on the data and its lineage rather than just the processing steps. This approach can lead to clearer, more maintainable, and more robust data pipelines, especially in complex or data-intensive environments

[DSC Europe 23] Antoni Ivanov - Make data central feature

Recommended

Recommended

More Related Content

Similar to [DSC Europe 23] Antoni Ivanov - Make data central feature

Similar to [DSC Europe 23] Antoni Ivanov - Make data central feature (20)

More from DataScienceConferenc1

More from DataScienceConferenc1 (20)

Recently uploaded

Recently uploaded (20)

[DSC Europe 23] Antoni Ivanov - Make data central feature

Editor's Notes