Observability for Data Pipelines With OpenLineage

•

0 likes•634 views

Data is increasingly becoming core to many products. Whether to provide recommendations for users, getting insights on how they use the product, or using machine learning to improve the experience. This creates a critical need for reliable data operations and understanding how data is flowing through our systems. Data pipelines must be auditable, reliable, and run on time. This proves particularly difficult in a constantly changing, fast-paced environment. Collecting this lineage metadata as data pipelines are running provides an understanding of dependencies between many teams consuming and producing data and how constant changes impact them. It is the underlying foundation that enables the many use cases related to data operations. The OpenLineage project is an API standardizing this metadata across the ecosystem, reducing complexity and duplicate work in collecting lineage information. It enables many projects, consumers of lineage in the ecosystem whether they focus on operations, governance or security. Marquez is an open source project part of the LF AI & Data foundation which instruments data pipelines to collect lineage and metadata and enable those use cases. It implements the OpenLineage API and provides context by making visible dependencies across organizations and technologies as they change over time.

Data & Analytics

Data lineage and observability
with OpenLineage
Julien Le Dem, CTO and Co-Founder Datakin | Mai 2021

AGENDA
● The need for metadata
● OpenLineage - the open standard for lineage
collection - and Marquez, its reference
implementation
● Spark observability with OpenLineage

Building a healthy data ecosystem
Team A Team B
Team C

5
Today: Limited context
● What is the data source?
● What is the schema?
● Who is the owner?
● How often is it updated?
● Where is it coming from?
● Who is using the data?
● What has changed?
DATA

Maslow’s Data hierarchy of needs
New Business Opportunities
Business Optimization
Data Quality
Data Freshness
Data Availability

OpenLineage contributors
Creators and contributors from major open source projects involved

Purpose:
Deﬁne an Open standard for
metadata and lineage collection
by instrumenting data pipelines
as they are running.

Problem
Before:
● Duplication of effort: Each project
has to instrument all jobs
● Integrations are external and can
break with new versions
● Effort of integration is shared
● Integration can be pushed in
each project: no need to play
catch up
With Open Lineage

Open Lineage scope Not in scope
Backend
Integrations
Metadata
and
lineage
collection
standard
Warehouse
Schedulers
...
Kafka
topic
Graph
db
HTTP
client
Consumers
Kafka
client
GraphDB
client
...

Core Model:
- JSONSchema spec
- Consistent naming:
Jobs:
scheduler.job.task
Datasets:
instance.schema.table
13

14
Protocol:
- Asynchronous events:
Unique run id for identifying a
run and correlate events
- Configurable backend:
- Kafka
- Http
Examples:
● Run Start event
○ source code version
○ run parameters
● Run Complete event
○ input dataset
○ output dataset version and schema

15
Facets
● Extensible:
Facets are atomic pieces of metadata
identiﬁed by a unique name that can be
attached to the core entities.
● Decentralized:
Preﬁxes in facet names allow the
deﬁnition of Custom facets that can be
promoted to the spec at a later point.

Metadata:
Ingest Storage Compute
Streaming
Batch/ML
● Data Platform
built around
Marquez
● Integrations
○ Ingest
○ Storage
○ Compute
Flink
Airflow
Kafka
Iceberg / S3
BI
OpenLineage

Marquez: Data model
Job
Dataset Job Version
Run
*
1
*
1
*
1
1
*
1
*
Source
1 *
● MYSQL
● POSTGRESQL
● REDSHIFT
● SNOWFLAKE
● KAFKA
● S3
● ICEBERG
● DELTALAKE
● BATCH
● STREAM
● SERVICE
Dataset Version

API
● Open Lineage and Marquez standardize
metadata collection
○ Job runs
○ Parameters
○ Version
○ Inputs / outputs
● Datakin enables
○ Understanding operational dependencies
○ Impact analysis
○ Troubleshooting: What has changed
since the last time it worked?
Datakin leverages Marquez metadata
Lineage analysis
Graph
Integrations

22
Spark java agent
spark.driver.extraJavaOptions:
-javaagent:marquez-spark-0.13.1.jar={argument}

Metadata
collected
23
Lineage: inputs/outputs
Data volume: row count/byte size
Logical plan

26
Example of
OpenLineage
metadata usage:
Data volume
evolution

Join the conversation
OpenLineage:
Github: github.com/OpenLineage
Slack: OpenLineage.slack.com
Twitter: @OpenLineage
Email: groups.google.com/g/openlineage
Marquez:
Github: github.com/MarquezProject/marquez
Slack: MarquezProject.slack.com
Twitter: @MarquezProject

What's hot

The Parquet Format and Performance Optimization OpportunitiesDatabricks

Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergAnant Corporation

Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue

Owning Your Own (Data) Lake HouseData Con LA

Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.

Common Strategies for Improving Performance on Your Delta LakehouseDatabricks

The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...Databricks

Iceberg: a fast table format for S3DataWorks Summit

[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...DataScienceConferenc1

Apache Flink internalsKostas Tzoumas

Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks

A Rusty introduction to Apache Arrow and how it applies to a time series dat...Andrew Lamb

Data pipelines observability: OpenLineage & MarquezJulien Le Dem

Free Training: How to Build a LakehouseDatabricks

Data lineage and observability with Marquez - subsurface 2020Julien Le Dem

Migrating Airflow-based Apache Spark Jobs to Kubernetes – the Native WayDatabricks

A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks

Databricks Platform.pptxAlex Ivy

Apache Spark Streaming in K8s with ArgoCD & Spark OperatorDatabricks

Apache Iceberg Presentation for the St. Louis Big Data IDEAAdam Doyle

What's hot (20)

The Parquet Format and Performance Optimization Opportunities

Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg

Iceberg: A modern table format for big data (Strata NY 2018)

Owning Your Own (Data) Lake House

Apache Iceberg - A Table Format for Hige Analytic Datasets

Common Strategies for Improving Performance on Your Delta Lakehouse

The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...

Iceberg: a fast table format for S3

[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...

Apache Flink internals

Architect’s Open-Source Guide for a Data Mesh Architecture

A Rusty introduction to Apache Arrow and how it applies to a time series dat...

Data pipelines observability: OpenLineage & Marquez

Free Training: How to Build a Lakehouse

Data lineage and observability with Marquez - subsurface 2020

Migrating Airflow-based Apache Spark Jobs to Kubernetes – the Native Way

A Thorough Comparison of Delta Lake, Iceberg and Hudi

Databricks Platform.pptx

Apache Spark Streaming in K8s with ArgoCD & Spark Operator

Apache Iceberg Presentation for the St. Louis Big Data IDEA

Similar to Observability for Data Pipelines With OpenLineage

Structured Streaming in SparkDigital Vidya

Upleveling Analytics with Kafka with Amy ChenHostedbyConfluent

Deploying Data Science Engines to ProductionMostafa Majidpour

The Enterprise Guide to Building a Data Mesh - Introducing SpecMeshIanFurlong4

Machine learning at scale with Google Cloud PlatformMatthias Feys

Data platform architecture principles - ieee infrastructure 2020Julien Le Dem

Gobblin @ NerdWallet (Nov 2015)NerdWalletHQ

SDN in the Management Plane: OpenConfig and Streaming TelemetryAnees Shaikh

Graph Data Science at ScaleNeo4j

Enterprise guide to building a Data MeshSion Smith

Big data processing systems researchVasia Kalavri

Managing Apache Spark Workload and Automatic OptimizingDatabricks

Data Platform in the CloudAmihay Zer-Kavod

Salesforce integration best practices columbus meetupMuleSoft Meetup

Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...Databricks

Spark Driven Big Data Analyticsinoshg

Clearing Airflow ObstructionsTatiana Al-Chueyr

MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB

OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...NETWAYS

Team Data Science Process Presentation (TDSP), Aug 29, 2017Debraj GuhaThakurta

Similar to Observability for Data Pipelines With OpenLineage (20)

Structured Streaming in Spark

Upleveling Analytics with Kafka with Amy Chen

Deploying Data Science Engines to Production

The Enterprise Guide to Building a Data Mesh - Introducing SpecMesh

Machine learning at scale with Google Cloud Platform

Data platform architecture principles - ieee infrastructure 2020

Gobblin @ NerdWallet (Nov 2015)

SDN in the Management Plane: OpenConfig and Streaming Telemetry

Graph Data Science at Scale

Enterprise guide to building a Data Mesh

Big data processing systems research

Managing Apache Spark Workload and Automatic Optimizing

Data Platform in the Cloud

Salesforce integration best practices columbus meetup

Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...

Spark Driven Big Data Analytics

Clearing Airflow Obstructions

MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas

OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...

Team Data Science Process Presentation (TDSP), Aug 29, 2017

Recently uploaded

Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop

Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823

5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795

Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823

Capstone Project on IBM Data Analytics ProgramMoniSankarHazra

Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Riyadh +966572737505 get cytotec

Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls

Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...amitlee9823

Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823

Midocean dropshipping via API with DroFxolyaivanovalion

Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangaloreamitlee9823

➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...amitlee9823

Detecting Credit Card Fraud: A Machine Learning ApproachBoston Institute of Analytics

Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal

DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann

April 2024 - Crypto Market Report's Analysismanisha194592

Probability Grade 10 Third Quarter LessonsJoseMangaJr1

➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...amitlee9823

Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823

BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls

Recently uploaded (20)

Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...

Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand

5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed

Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand

Capstone Project on IBM Data Analytics Program

Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec

Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...

Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...

Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand

Midocean dropshipping via API with DroFx

Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore

➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...

Detecting Credit Card Fraud: A Machine Learning Approach

Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure

DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK

April 2024 - Crypto Market Report's Analysis

Probability Grade 10 Third Quarter Lessons

➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...

Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...

BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service

Observability for Data Pipelines With OpenLineage

1. Data lineage and observability with OpenLineage Julien Le Dem, CTO and Co-Founder Datakin | Mai 2021

2. AGENDA ● The need for metadata ● OpenLineage - the open standard for lineage collection - and Marquez, its reference implementation ● Spark observability with OpenLineage

3. The need for Metadata 3

4. Building a healthy data ecosystem Team A Team B Team C

5. 5 Today: Limited context ● What is the data source? ● What is the schema? ● Who is the owner? ● How often is it updated? ● Where is it coming from? ● Who is using the data? ● What has changed? DATA

6. Maslow’s Data hierarchy of needs New Business Opportunities Business Optimization Data Quality Data Freshness Data Availability

7. OpenLineage 7

8. OpenLineage contributors Creators and contributors from major open source projects involved

9. Purpose: Deﬁne an Open standard for metadata and lineage collection by instrumenting data pipelines as they are running.

10. Purpose: EXIF for data pipelines

11. Problem Before: ● Duplication of effort: Each project has to instrument all jobs ● Integrations are external and can break with new versions ● Effort of integration is shared ● Integration can be pushed in each project: no need to play catch up With Open Lineage

12. Open Lineage scope Not in scope Backend Integrations Metadata and lineage collection standard Warehouse Schedulers ... Kafka topic Graph db HTTP client Consumers Kafka client GraphDB client ...

13. Core Model: - JSONSchema spec - Consistent naming: Jobs: scheduler.job.task Datasets: instance.schema.table 13

14. 14 Protocol: - Asynchronous events: Unique run id for identifying a run and correlate events - Configurable backend: - Kafka - Http Examples: ● Run Start event ○ source code version ○ run parameters ● Run Complete event ○ input dataset ○ output dataset version and schema

15. 15 Facets ● Extensible: Facets are atomic pieces of metadata identified by a unique name that can be attached to the core entities. ● Decentralized: Prefixes in facet names allow the definition of Custom facets that can be promoted to the spec at a later point.

16. Facet examples Dataset: - Stats - Schema - Version - Column level lineage Job: - Source code - Dependencies - params - Source control - Query plan - Query profile Run: - Schedule time - Batch id

17. 17

18. Metadata: Ingest Storage Compute Streaming Batch/ML ● Data Platform built around Marquez ● Integrations ○ Ingest ○ Storage ○ Compute Flink Airflow Kafka Iceberg / S3 BI OpenLineage

19. Marquez: Data model Job Dataset Job Version Run * 1 * 1 * 1 1 * 1 * Source 1 * ● MYSQL ● POSTGRESQL ● REDSHIFT ● SNOWFLAKE ● KAFKA ● S3 ● ICEBERG ● DELTALAKE ● BATCH ● STREAM ● SERVICE Dataset Version

20. API ● Open Lineage and Marquez standardize metadata collection ○ Job runs ○ Parameters ○ Version ○ Inputs / outputs ● Datakin enables ○ Understanding operational dependencies ○ Impact analysis ○ Troubleshooting: What has changed since the last time it worked? Datakin leverages Marquez metadata Lineage analysis Graph Integrations

21. Spark observability with OpenLineage 21

22. 22 Spark java agent spark.driver.extraJavaOptions: -javaagent:marquez-spark-0.13.1.jar={argument}

23. Metadata collected 23 Lineage: inputs/outputs Data volume: row count/byte size Logical plan

24. Lineage model 24

25. Lineage Example across jobs

26. 26 Example of OpenLineage metadata usage: Data volume evolution

27. Join the conversation OpenLineage: Github: github.com/OpenLineage Slack: OpenLineage.slack.com Twitter: @OpenLineage Email: groups.google.com/g/openlineage Marquez: Github: github.com/MarquezProject/marquez Slack: MarquezProject.slack.com Twitter: @MarquezProject

Observability for Data Pipelines With OpenLineage

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Observability for Data Pipelines With OpenLineage

Similar to Observability for Data Pipelines With OpenLineage (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Observability for Data Pipelines With OpenLineage