Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Data lineage and observability
with OpenLineage
Julien Le Dem, CTO and Co-Founder Datakin | Mai 2021
AGENDA
● The need for metadata
● OpenLineage - the open standard for lineage
collection - and Marquez, its reference
imple...
The need for Metadata
3
Building a healthy data ecosystem
Team A Team B
Team C
5
Today: Limited context
● What is the data source?
● What is the schema?
● Who is the owner?
● How often is it updated?
●...
Maslow’s Data hierarchy of needs
New Business Opportunities
Business Optimization
Data Quality
Data Freshness
Data Availab...
OpenLineage
7
OpenLineage contributors
Creators and contributors from major open source projects involved
Purpose:
Define an Open standard for
metadata and lineage collection
by instrumenting data pipelines
as they are running.
Purpose:
EXIF for data pipelines
Problem
Before:
● Duplication of effort: Each project
has to instrument all jobs
● Integrations are external and can
break...
Open Lineage scope Not in scope
Backend
Integrations
Metadata
and
lineage
collection
standard
Warehouse
Schedulers
...
Kaf...
Core Model:
- JSONSchema spec
- Consistent naming:
Jobs:
scheduler.job.task
Datasets:
instance.schema.table
13
14
Protocol:
- Asynchronous events:
Unique run id for identifying a
run and correlate events
- Configurable backend:
- Kaf...
15
Facets
● Extensible:
Facets are atomic pieces of metadata
identified by a unique name that can be
attached to the core e...
Facet examples
Dataset:
- Stats
- Schema
- Version
- Column level
lineage
Job:
- Source code
- Dependencies
- params
- Sou...
17
Metadata:
Ingest Storage Compute
Streaming
Batch/ML
● Data Platform
built around
Marquez
● Integrations
○ Ingest
○ Storage...
Marquez: Data model
Job
Dataset Job Version
Run
*
1
*
1
*
1
1
*
1
*
Source
1 *
● MYSQL
● POSTGRESQL
● REDSHIFT
● SNOWFLAKE...
API
● Open Lineage and Marquez standardize
metadata collection
○ Job runs
○ Parameters
○ Version
○ Inputs / outputs
● Data...
Spark observability with
OpenLineage
21
22
Spark java agent
spark.driver.extraJavaOptions:
-javaagent:marquez-spark-0.13.1.jar={argument}
Metadata
collected
23
Lineage: inputs/outputs
Data volume: row count/byte size
Logical plan
Lineage model
24
Lineage Example across jobs
26
Example of
OpenLineage
metadata usage:
Data volume
evolution
Join the conversation
OpenLineage:
Github: github.com/OpenLineage
Slack: OpenLineage.slack.com
Twitter: @OpenLineage
Email...
You’ve finished this document.
Download and read it offline.
Upcoming SlideShare
What to Upload to SlideShare
Next
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

Share

Observability for Data Pipelines With OpenLineage

Download to read offline

Data is increasingly becoming core to many products. Whether to provide recommendations for users, getting insights on how they use the product, or using machine learning to improve the experience. This creates a critical need for reliable data operations and understanding how data is flowing through our systems. Data pipelines must be auditable, reliable, and run on time. This proves particularly difficult in a constantly changing, fast-paced environment.

Collecting this lineage metadata as data pipelines are running provides an understanding of dependencies between many teams consuming and producing data and how constant changes impact them. It is the underlying foundation that enables the many use cases related to data operations. The OpenLineage project is an API standardizing this metadata across the ecosystem, reducing complexity and duplicate work in collecting lineage information. It enables many projects, consumers of lineage in the ecosystem whether they focus on operations, governance or security.

Marquez is an open source project part of the LF AI & Data foundation which instruments data pipelines to collect lineage and metadata and enable those use cases. It implements the OpenLineage API and provides context by making visible dependencies across organizations and technologies as they change over time.

  • Be the first to like this

Observability for Data Pipelines With OpenLineage

  1. 1. Data lineage and observability with OpenLineage Julien Le Dem, CTO and Co-Founder Datakin | Mai 2021
  2. 2. AGENDA ● The need for metadata ● OpenLineage - the open standard for lineage collection - and Marquez, its reference implementation ● Spark observability with OpenLineage
  3. 3. The need for Metadata 3
  4. 4. Building a healthy data ecosystem Team A Team B Team C
  5. 5. 5 Today: Limited context ● What is the data source? ● What is the schema? ● Who is the owner? ● How often is it updated? ● Where is it coming from? ● Who is using the data? ● What has changed? DATA
  6. 6. Maslow’s Data hierarchy of needs New Business Opportunities Business Optimization Data Quality Data Freshness Data Availability
  7. 7. OpenLineage 7
  8. 8. OpenLineage contributors Creators and contributors from major open source projects involved
  9. 9. Purpose: Define an Open standard for metadata and lineage collection by instrumenting data pipelines as they are running.
  10. 10. Purpose: EXIF for data pipelines
  11. 11. Problem Before: ● Duplication of effort: Each project has to instrument all jobs ● Integrations are external and can break with new versions ● Effort of integration is shared ● Integration can be pushed in each project: no need to play catch up With Open Lineage
  12. 12. Open Lineage scope Not in scope Backend Integrations Metadata and lineage collection standard Warehouse Schedulers ... Kafka topic Graph db HTTP client Consumers Kafka client GraphDB client ...
  13. 13. Core Model: - JSONSchema spec - Consistent naming: Jobs: scheduler.job.task Datasets: instance.schema.table 13
  14. 14. 14 Protocol: - Asynchronous events: Unique run id for identifying a run and correlate events - Configurable backend: - Kafka - Http Examples: ● Run Start event ○ source code version ○ run parameters ● Run Complete event ○ input dataset ○ output dataset version and schema
  15. 15. 15 Facets ● Extensible: Facets are atomic pieces of metadata identified by a unique name that can be attached to the core entities. ● Decentralized: Prefixes in facet names allow the definition of Custom facets that can be promoted to the spec at a later point.
  16. 16. Facet examples Dataset: - Stats - Schema - Version - Column level lineage Job: - Source code - Dependencies - params - Source control - Query plan - Query profile Run: - Schedule time - Batch id
  17. 17. 17
  18. 18. Metadata: Ingest Storage Compute Streaming Batch/ML ● Data Platform built around Marquez ● Integrations ○ Ingest ○ Storage ○ Compute Flink Airflow Kafka Iceberg / S3 BI OpenLineage
  19. 19. Marquez: Data model Job Dataset Job Version Run * 1 * 1 * 1 1 * 1 * Source 1 * ● MYSQL ● POSTGRESQL ● REDSHIFT ● SNOWFLAKE ● KAFKA ● S3 ● ICEBERG ● DELTALAKE ● BATCH ● STREAM ● SERVICE Dataset Version
  20. 20. API ● Open Lineage and Marquez standardize metadata collection ○ Job runs ○ Parameters ○ Version ○ Inputs / outputs ● Datakin enables ○ Understanding operational dependencies ○ Impact analysis ○ Troubleshooting: What has changed since the last time it worked? Datakin leverages Marquez metadata Lineage analysis Graph Integrations
  21. 21. Spark observability with OpenLineage 21
  22. 22. 22 Spark java agent spark.driver.extraJavaOptions: -javaagent:marquez-spark-0.13.1.jar={argument}
  23. 23. Metadata collected 23 Lineage: inputs/outputs Data volume: row count/byte size Logical plan
  24. 24. Lineage model 24
  25. 25. Lineage Example across jobs
  26. 26. 26 Example of OpenLineage metadata usage: Data volume evolution
  27. 27. Join the conversation OpenLineage: Github: github.com/OpenLineage Slack: OpenLineage.slack.com Twitter: @OpenLineage Email: groups.google.com/g/openlineage Marquez: Github: github.com/MarquezProject/marquez Slack: MarquezProject.slack.com Twitter: @MarquezProject

Data is increasingly becoming core to many products. Whether to provide recommendations for users, getting insights on how they use the product, or using machine learning to improve the experience. This creates a critical need for reliable data operations and understanding how data is flowing through our systems. Data pipelines must be auditable, reliable, and run on time. This proves particularly difficult in a constantly changing, fast-paced environment. Collecting this lineage metadata as data pipelines are running provides an understanding of dependencies between many teams consuming and producing data and how constant changes impact them. It is the underlying foundation that enables the many use cases related to data operations. The OpenLineage project is an API standardizing this metadata across the ecosystem, reducing complexity and duplicate work in collecting lineage information. It enables many projects, consumers of lineage in the ecosystem whether they focus on operations, governance or security. Marquez is an open source project part of the LF AI & Data foundation which instruments data pipelines to collect lineage and metadata and enable those use cases. It implements the OpenLineage API and provides context by making visible dependencies across organizations and technologies as they change over time.

Views

Total views

95

On Slideshare

0

From embeds

0

Number of embeds

0

Actions

Downloads

3

Shares

0

Comments

0

Likes

0

×