The document discusses the importance of metadata in data ecosystems and introduces OpenLineage, an open standard for lineage collection. It highlights the need for effective metadata management to avoid duplication and enhance collaboration across data pipelines, with an emphasis on integrations and observability in Spark. The document also outlines the core model and facets for metadata, detailing how OpenLineage and Marquez can facilitate understanding of data dependencies and support troubleshooting efforts.
Data lineage andobservability
with OpenLineage
Julien Le Dem, CTO and Co-Founder Datakin | Mai 2021
2.
AGENDA
● The needfor metadata
● OpenLineage - the open standard for lineage
collection - and Marquez, its reference
implementation
● Spark observability with OpenLineage
5
Today: Limited context
●What is the data source?
● What is the schema?
● Who is the owner?
● How often is it updated?
● Where is it coming from?
● Who is using the data?
● What has changed?
DATA
6.
Maslow’s Data hierarchyof needs
New Business Opportunities
Business Optimization
Data Quality
Data Freshness
Data Availability
Problem
Before:
● Duplication ofeffort: Each project
has to instrument all jobs
● Integrations are external and can
break with new versions
● Effort of integration is shared
● Integration can be pushed in
each project: no need to play
catch up
With Open Lineage
12.
Open Lineage scopeNot in scope
Backend
Integrations
Metadata
and
lineage
collection
standard
Warehouse
Schedulers
...
Kafka
topic
Graph
db
HTTP
client
Consumers
Kafka
client
GraphDB
client
...
14
Protocol:
- Asynchronous events:
Uniquerun id for identifying a
run and correlate events
- Configurable backend:
- Kafka
- Http
Examples:
● Run Start event
○ source code version
○ run parameters
● Run Complete event
○ input dataset
○ output dataset version and schema
15.
15
Facets
● Extensible:
Facets areatomic pieces of metadata
identified by a unique name that can be
attached to the core entities.
● Decentralized:
Prefixes in facet names allow the
definition of Custom facets that can be
promoted to the spec at a later point.
16.
Facet examples
Dataset:
- Stats
-Schema
- Version
- Column level
lineage
Job:
- Source code
- Dependencies
- params
- Source control
- Query plan
- Query profile
Run:
- Schedule time
- Batch id
Marquez: Data model
Job
DatasetJob Version
Run
*
1
*
1
*
1
1
*
1
*
Source
1 *
● MYSQL
● POSTGRESQL
● REDSHIFT
● SNOWFLAKE
● KAFKA
● S3
● ICEBERG
● DELTALAKE
● BATCH
● STREAM
● SERVICE
Dataset Version
20.
API
● Open Lineageand Marquez standardize
metadata collection
○ Job runs
○ Parameters
○ Version
○ Inputs / outputs
● Datakin enables
○ Understanding operational dependencies
○ Impact analysis
○ Troubleshooting: What has changed
since the last time it worked?
Datakin leverages Marquez metadata
Lineage analysis
Graph
Integrations