"OpenLineage is an open platform for the collection and analysis of data lineage, which includes an open standard for lineage data collection, integration libraries for the most common tools, and a metadata repository/reference implementation (Marquez).
In recent months, stream processing, which is an important use case for Apache Kafka, has gained the particular focus of the OpenLineage community with many useful features completed or begun, including:
* A seamless OpenLineage & Apache Flink integration,
* Support for streaming jobs in Marquez,
* Progress on a built-in lineage API within the Flink codebase.
Cross-platform lineage allows for a holistic overview of data flow and its dependencies within organizations, including stream processing.
This talk will provide an overview of the most recent developments in the OpenLineage Flink integration and share what’s in store for this important collaboration.
This talk is a must-attend for those wishing to stay up-to-date on lineage developments in the stream processing world."
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
OpenLineage for Stream Processing | Kafka Summit London
1. OpenLineage For Stream Processing
Paweł Leszczyński (github pawel-big-lebowski)
Maciej Obuchowski (github mobuchowski)
Kafka Summit 2024
2. 2
Agenda
● OpenLineage intro & demo
○ Why do we need lineage?
○ Why having an open lineage?
○ Marquez and Flink demo
● Flink integration deep dive
○ Lineage for batch & streaming
○ Review of OpenLineage-Flink integration, FLIP-314
○ What does the future hold?
7. 7
To define an open standard
for the collection of lineage
metadata from pipelines
as they are running.
OpenLineage
Mission
8. Data model
8
Run is particular instance
of a streaming job
Job is data pipeline that
processes data
Datasets are Kafka topics,
Iceberg tables, Object
Storage destinations and
so on
transition
transition time
Run State Update
run uuid
Run
job id
(name based)
Job
dataset id
(name based)
Dataset
Run Facet
Job Facet
Dataset
Facet
run
job
inputs /
outputs
15. What is different for Streaming jobs?
15
Batch and streaming differ in many
aspects, but for lineage there are
few questions that matter:
● When does the unbounded
job end?
● When and how datasets get
updated?
● Does the transformation
change during execution?
16. When does job end?
16
● It might seem that streaming
jobs never end naturally
● Schema changes, new job
versions, new engine versions
- points when it’s worth to start
another run
17. When does dataset gets updated?
17
● Dataset versioning is pretty
important - bug analysis, data
freshness
● Implicit - “last update
timestamp”, Airflow’s data
interval - OL default
● Explicit - Iceberg, Delta Lake
dataset version
18. When does dataset gets updated?
18
● In streaming, it’s not so
obvious as in batch
● Update on each row write
would produce more
metadata than actual data…
● Update only on potential job
end would not give us any
meaningful information in the
meantime
19. When does dataset gets updated?
19
● Flink: maybe on checkpoint?
● Checkpointing is finicky,
100ms vs 10 minute
checkpoint interval
● Configure minimum event
emission interval separately
● OpenLineage’s additive
model fits that really well
● Spark: microbatch?
20. Dynamic transformation modification
20
● KafkaSource can find new
topic during execution when
passed a wildcard pattern
● We can catch this and emit
event containing this
information when this
happens
22. OpenLineage has Flink integration!
● OpenLineage has Flink
JobListener that notifies you
on job start and end
● Support for Kafka, Iceberg,
Cassandra, JDBC…
● Notifies you when job starts,
ends, and on checkpoint with
particular interval
● Additional metadata:
schemas, how much data
processed…
24. The integration has its limits
● Very limited, requires few
undesirable things like setting
execution.attached
● No SQL or Table API support!
● Need to manually attach
JobListener to every job
● OpenLineage preferred
solution would be to run
listener on JobManager in a
separate thread
25. And the internals are even more complex
● Basically, a lot of reflection
● API wasn’t made for this use
case, a lot of things are
private, a lot of things are in
the class internals
● OpenLineage preferred
solution would be to have API
for connectors to implement,
where they would be
responsible for providing
correct data
26. And even has evil hacks
● List of transformations inside
StreamExecutionEnvironment
gets cleared moment before
calling JobListeners
● Before that happens, we
replace the clearable list with
one that keeps copy of data
on `clear`.
27. So, why bother?
● We’ve opportunistically created the integration despite limitations, to gather
interest and provide even that limited value
● The long-term solution would be new API for Flink that would not have any of
those limitations
○ Single API that for DataStream and SQL APIs
○ Not depending on any particular execution mode
○ Connectors responsible for their own lineage - testable and dependable!
○ No reflection :)
○ Possible to have Column-Level Lineage support in the future
● And we’ve waited in that state for a bit
28. And then something happened
● FLIP-314 - Support Customized Flink Job Listener by Fang Yong, Zhanghao Chen
● New JobStatusChangedListener
○ JobCreatedEvent
○ JobExecutionStatusEvent
● JobCreatedEvent contains LineageGraph
● Both DataStream and
SQL/Table API support
● No attachment problem
● Sounds perfect?
31. Problem with LineageVertex
● How do you know all connector implementations?
● How do you support custom connectors, where we can’t get the source?
○ …reflection?
32. Problem with LineageVertex
● How do you know all connector implementations?
● How do you support custom connectors, for which the code is not known?
● How do you deal with breaking changes in connectors?
○ …even more reflection?
33. Find a solution with community
● Voice your concern, propose how to resolve the issue
● Open discussion on Jira, Flink Slack, mailing list
● Managed to gain consensus and develop a solution that fits everyone involved
● Build community around lineage
36. Facets Allow to Extend Data
● Directly inspired by
OpenLineage facets
● Allow you to attach any atomic
piece of metadata to your
dataset or vertex metadata
● Both build-in into Flink - like
DatasetSchemaFacet - and
external, or specific per
connector
37. FLIP-314 will power OpenLineage
● Lineage driven by connectors is resilient
● Works for both DataStream and SQL/Table APIs
● Not dependant on any execution mode
39. Support for other streaming systems
● Spark Streaming
● Kafka Connect
● …
40. Column-level lineage support for Flink
● It’s a hard problem!
● But maybe not for SQL?
● UDFs definitely break simple solutions
41. Native support for Spark connectors
● In contrast to Flink, Spark already has extension mechanism that allows you to
view the internals of the job as it’s running - SparkListener
● We use LogicalPlan abstraction to extract metadata
● We have very similar issues as with Flink :)
● Internal vs external logical plan interfaces
● DataSourceV2 implementations
42. Support for “raw” Kafka client
● It’s very popular to use raw client to build your own system, not only external
systems
● bootstrap.servers is non unique and ambiguous - use Kafka cluster ID
● Execution is spread over multiple clients - but maybe not every one of them
needs to always report
43. OpenLineage is Open Source
● OpenLineage integrations are open source and open governance
within LF AI & Data
● The best way to fix a problem is to fix it yourself :)
● Second best way is to be active and raise awareness
○ Maybe other people are also interested?