SlideShare a Scribd company logo
1 of 44
Download to read offline
OpenLineage For Stream Processing
Paweł Leszczyński (github pawel-big-lebowski)
Maciej Obuchowski (github mobuchowski)
Kafka Summit 2024
2
Agenda
● OpenLineage intro & demo
○ Why do we need lineage?
○ Why having an open lineage?
○ Marquez and Flink demo
● Flink integration deep dive
○ Lineage for batch & streaming
○ Review of OpenLineage-Flink integration, FLIP-314
○ What does the future hold?
OpenLineage
1
Autumn Rhythm - Jackson Pollock
https://www.flickr.com/photos/thoth188/276162883
https://flic.kr/p/hjxW62
7
To define an open standard
for the collection of lineage
metadata from pipelines
as they are running.
OpenLineage
Mission
Data model
8
Run is particular instance
of a streaming job
Job is data pipeline that
processes data
Datasets are Kafka topics,
Iceberg tables, Object
Storage destinations and
so on
transition
transition time
Run State Update
run uuid
Run
job id
(name based)
Job
dataset id
(name based)
Dataset
Run Facet
Job Facet
Dataset
Facet
run
job
inputs /
outputs
Producers Consumers
Marquez and Flink
Integration Demo
2
Demo
● Available under
https://github.com/OpenLineage/workshops/tree/main/flink-streaming
● Contains
○ Two Flink jobs
■ Kafka to Postgres
■ Postgres to Kafka
○ Airflow to run some Postgres queries
○ Marquez to present lineage graph
Flink applications - read & write to Kafka
Airflow DAG
OpenLineage for
Streaming
3
What is different for Streaming jobs?
15
Batch and streaming differ in many
aspects, but for lineage there are
few questions that matter:
● When does the unbounded
job end?
● When and how datasets get
updated?
● Does the transformation
change during execution?
When does job end?
16
● It might seem that streaming
jobs never end naturally
● Schema changes, new job
versions, new engine versions
- points when it’s worth to start
another run
When does dataset gets updated?
17
● Dataset versioning is pretty
important - bug analysis, data
freshness
● Implicit - “last update
timestamp”, Airflow’s data
interval - OL default
● Explicit - Iceberg, Delta Lake
dataset version
When does dataset gets updated?
18
● In streaming, it’s not so
obvious as in batch
● Update on each row write
would produce more
metadata than actual data…
● Update only on potential job
end would not give us any
meaningful information in the
meantime
When does dataset gets updated?
19
● Flink: maybe on checkpoint?
● Checkpointing is finicky,
100ms vs 10 minute
checkpoint interval
● Configure minimum event
emission interval separately
● OpenLineage’s additive
model fits that really well
● Spark: microbatch?
Dynamic transformation modification
20
● KafkaSource can find new
topic during execution when
passed a wildcard pattern
● We can catch this and emit
event containing this
information when this
happens
OpenLineage Flink
Integration
update
4
OpenLineage has Flink integration!
● OpenLineage has Flink
JobListener that notifies you
on job start and end
● Support for Kafka, Iceberg,
Cassandra, JDBC…
● Notifies you when job starts,
ends, and on checkpoint with
particular interval
● Additional metadata:
schemas, how much data
processed…
Idea is simple, execution is more complex
The integration has its limits
● Very limited, requires few
undesirable things like setting
execution.attached
● No SQL or Table API support!
● Need to manually attach
JobListener to every job
● OpenLineage preferred
solution would be to run
listener on JobManager in a
separate thread
And the internals are even more complex
● Basically, a lot of reflection
● API wasn’t made for this use
case, a lot of things are
private, a lot of things are in
the class internals
● OpenLineage preferred
solution would be to have API
for connectors to implement,
where they would be
responsible for providing
correct data
And even has evil hacks
● List of transformations inside
StreamExecutionEnvironment
gets cleared moment before
calling JobListeners
● Before that happens, we
replace the clearable list with
one that keeps copy of data
on `clear`.
So, why bother?
● We’ve opportunistically created the integration despite limitations, to gather
interest and provide even that limited value
● The long-term solution would be new API for Flink that would not have any of
those limitations
○ Single API that for DataStream and SQL APIs
○ Not depending on any particular execution mode
○ Connectors responsible for their own lineage - testable and dependable!
○ No reflection :)
○ Possible to have Column-Level Lineage support in the future
● And we’ve waited in that state for a bit
And then something happened
● FLIP-314 - Support Customized Flink Job Listener by Fang Yong, Zhanghao Chen
● New JobStatusChangedListener
○ JobCreatedEvent
○ JobExecutionStatusEvent
● JobCreatedEvent contains LineageGraph
● Both DataStream and
SQL/Table API support
● No attachment problem
● Sounds perfect?
LineageGraph
Problem with LineageVertex
● How do you know all possible connector implementations?
Problem with LineageVertex
● How do you know all connector implementations?
● How do you support custom connectors, where we can’t get the source?
○ …reflection?
Problem with LineageVertex
● How do you know all connector implementations?
● How do you support custom connectors, for which the code is not known?
● How do you deal with breaking changes in connectors?
○ …even more reflection?
Find a solution with community
● Voice your concern, propose how to resolve the issue
● Open discussion on Jira, Flink Slack, mailing list
● Managed to gain consensus and develop a solution that fits everyone involved
● Build community around lineage
Resulting API is really nice
Resulting API is really nice
Facets Allow to Extend Data
● Directly inspired by
OpenLineage facets
● Allow you to attach any atomic
piece of metadata to your
dataset or vertex metadata
● Both build-in into Flink - like
DatasetSchemaFacet - and
external, or specific per
connector
FLIP-314 will power OpenLineage
● Lineage driven by connectors is resilient
● Works for both DataStream and SQL/Table APIs
● Not dependant on any execution mode
What does the
future hold?
5
Support for other streaming systems
● Spark Streaming
● Kafka Connect
● …
Column-level lineage support for Flink
● It’s a hard problem!
● But maybe not for SQL?
● UDFs definitely break simple solutions
Native support for Spark connectors
● In contrast to Flink, Spark already has extension mechanism that allows you to
view the internals of the job as it’s running - SparkListener
● We use LogicalPlan abstraction to extract metadata
● We have very similar issues as with Flink :)
● Internal vs external logical plan interfaces
● DataSourceV2 implementations
Support for “raw” Kafka client
● It’s very popular to use raw client to build your own system, not only external
systems
● bootstrap.servers is non unique and ambiguous - use Kafka cluster ID
● Execution is spread over multiple clients - but maybe not every one of them
needs to always report
OpenLineage is Open Source
● OpenLineage integrations are open source and open governance
within LF AI & Data
● The best way to fix a problem is to fix it yourself :)
● Second best way is to be active and raise awareness
○ Maybe other people are also interested?
Thanks :)

More Related Content

Similar to OpenLineage for Stream Processing | Kafka Summit London

Introduction to Flink Streaming
Introduction to Flink StreamingIntroduction to Flink Streaming
Introduction to Flink Streamingdatamantra
 
Apache Beam: Lote portátil y procesamiento de transmisión
Apache Beam: Lote portátil y procesamiento de transmisiónApache Beam: Lote portátil y procesamiento de transmisión
Apache Beam: Lote portátil y procesamiento de transmisiónGlobant
 
Kafka Connect: Operational Lessons Learned from the Trenches (Elizabeth Benne...
Kafka Connect: Operational Lessons Learned from the Trenches (Elizabeth Benne...Kafka Connect: Operational Lessons Learned from the Trenches (Elizabeth Benne...
Kafka Connect: Operational Lessons Learned from the Trenches (Elizabeth Benne...confluent
 
Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flinkdatamantra
 
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaDataWorks Summit
 
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...Athens Big Data
 
Integrating Flink with Hive, Seattle Flink Meetup, Feb 2019
Integrating Flink with Hive, Seattle Flink Meetup, Feb 2019Integrating Flink with Hive, Seattle Flink Meetup, Feb 2019
Integrating Flink with Hive, Seattle Flink Meetup, Feb 2019Bowen Li
 
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaDatabricks
 
Writing an Interactive Interface for SQL on Flink
Writing an Interactive Interface for SQL on FlinkWriting an Interactive Interface for SQL on Flink
Writing an Interactive Interface for SQL on FlinkEventador
 
Apache Flink: Past, Present and Future
Apache Flink: Past, Present and FutureApache Flink: Past, Present and Future
Apache Flink: Past, Present and FutureGyula Fóra
 
Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Flink Forward
 
SpringOne 2016 in a nutshell
SpringOne 2016 in a nutshellSpringOne 2016 in a nutshell
SpringOne 2016 in a nutshellJeroen Resoort
 
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, ShopifyIt's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, ShopifyHostedbyConfluent
 
Angular2 - A story from the trenches
Angular2 - A story from the trenchesAngular2 - A story from the trenches
Angular2 - A story from the trenchesJohannes Rudolph
 
Building Stream Processing as a Service
Building Stream Processing as a ServiceBuilding Stream Processing as a Service
Building Stream Processing as a ServiceSteven Wu
 
AirBNB's ML platform - BigHead
AirBNB's ML platform - BigHeadAirBNB's ML platform - BigHead
AirBNB's ML platform - BigHeadKarthik Murugesan
 
Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
 Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa... Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...Databricks
 

Similar to OpenLineage for Stream Processing | Kafka Summit London (20)

Introduction to Flink Streaming
Introduction to Flink StreamingIntroduction to Flink Streaming
Introduction to Flink Streaming
 
Apache Beam: Lote portátil y procesamiento de transmisión
Apache Beam: Lote portátil y procesamiento de transmisiónApache Beam: Lote portátil y procesamiento de transmisión
Apache Beam: Lote portátil y procesamiento de transmisión
 
Monkey Server
Monkey ServerMonkey Server
Monkey Server
 
Kafka Connect: Operational Lessons Learned from the Trenches (Elizabeth Benne...
Kafka Connect: Operational Lessons Learned from the Trenches (Elizabeth Benne...Kafka Connect: Operational Lessons Learned from the Trenches (Elizabeth Benne...
Kafka Connect: Operational Lessons Learned from the Trenches (Elizabeth Benne...
 
Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flink
 
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
 
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
 
Integrating Flink with Hive, Seattle Flink Meetup, Feb 2019
Integrating Flink with Hive, Seattle Flink Meetup, Feb 2019Integrating Flink with Hive, Seattle Flink Meetup, Feb 2019
Integrating Flink with Hive, Seattle Flink Meetup, Feb 2019
 
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
 
Writing an Interactive Interface for SQL on Flink
Writing an Interactive Interface for SQL on FlinkWriting an Interactive Interface for SQL on Flink
Writing an Interactive Interface for SQL on Flink
 
Apache Flink: Past, Present and Future
Apache Flink: Past, Present and FutureApache Flink: Past, Present and Future
Apache Flink: Past, Present and Future
 
Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...
 
SpringOne 2016 in a nutshell
SpringOne 2016 in a nutshellSpringOne 2016 in a nutshell
SpringOne 2016 in a nutshell
 
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, ShopifyIt's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
 
Angular2 - A story from the trenches
Angular2 - A story from the trenchesAngular2 - A story from the trenches
Angular2 - A story from the trenches
 
Building Stream Processing as a Service
Building Stream Processing as a ServiceBuilding Stream Processing as a Service
Building Stream Processing as a Service
 
Java one2013
Java one2013Java one2013
Java one2013
 
AirBNB's ML platform - BigHead
AirBNB's ML platform - BigHeadAirBNB's ML platform - BigHead
AirBNB's ML platform - BigHead
 
Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
 Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa... Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
 
Apache flink
Apache flinkApache flink
Apache flink
 

More from HostedbyConfluent

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Renaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonRenaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonHostedbyConfluent
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolEvolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolHostedbyConfluent
 
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesEnsuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesHostedbyConfluent
 
Exactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaExactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaHostedbyConfluent
 
Fish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonFish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonHostedbyConfluent
 
Tiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonTiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonHostedbyConfluent
 
Building a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyBuilding a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyHostedbyConfluent
 
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...HostedbyConfluent
 
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...HostedbyConfluent
 
Navigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersNavigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersHostedbyConfluent
 
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformApache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformHostedbyConfluent
 
Explaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubExplaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubHostedbyConfluent
 
TL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonTL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonHostedbyConfluent
 
A Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLA Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLHostedbyConfluent
 
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceMastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceHostedbyConfluent
 
Data Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondData Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondHostedbyConfluent
 
Code-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsCode-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsHostedbyConfluent
 
Debezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemDebezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemHostedbyConfluent
 
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksBeyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksHostedbyConfluent
 

More from HostedbyConfluent (20)

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Renaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonRenaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit London
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolEvolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at Trendyol
 
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesEnsuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
 
Exactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaExactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and Kafka
 
Fish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonFish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit London
 
Tiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonTiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit London
 
Building a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyBuilding a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And Why
 
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
 
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
 
Navigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersNavigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka Clusters
 
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformApache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
 
Explaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubExplaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy Pub
 
TL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonTL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit London
 
A Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLA Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSL
 
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceMastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
 
Data Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondData Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and Beyond
 
Code-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsCode-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink Apps
 
Debezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemDebezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC Ecosystem
 
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksBeyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local Disks
 

Recently uploaded

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetEnjoy Anytime
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 

Recently uploaded (20)

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 

OpenLineage for Stream Processing | Kafka Summit London

  • 1. OpenLineage For Stream Processing Paweł Leszczyński (github pawel-big-lebowski) Maciej Obuchowski (github mobuchowski) Kafka Summit 2024
  • 2. 2 Agenda ● OpenLineage intro & demo ○ Why do we need lineage? ○ Why having an open lineage? ○ Marquez and Flink demo ● Flink integration deep dive ○ Lineage for batch & streaming ○ Review of OpenLineage-Flink integration, FLIP-314 ○ What does the future hold?
  • 4. Autumn Rhythm - Jackson Pollock https://www.flickr.com/photos/thoth188/276162883
  • 6.
  • 7. 7 To define an open standard for the collection of lineage metadata from pipelines as they are running. OpenLineage Mission
  • 8. Data model 8 Run is particular instance of a streaming job Job is data pipeline that processes data Datasets are Kafka topics, Iceberg tables, Object Storage destinations and so on transition transition time Run State Update run uuid Run job id (name based) Job dataset id (name based) Dataset Run Facet Job Facet Dataset Facet run job inputs / outputs
  • 11. Demo ● Available under https://github.com/OpenLineage/workshops/tree/main/flink-streaming ● Contains ○ Two Flink jobs ■ Kafka to Postgres ■ Postgres to Kafka ○ Airflow to run some Postgres queries ○ Marquez to present lineage graph
  • 12. Flink applications - read & write to Kafka
  • 15. What is different for Streaming jobs? 15 Batch and streaming differ in many aspects, but for lineage there are few questions that matter: ● When does the unbounded job end? ● When and how datasets get updated? ● Does the transformation change during execution?
  • 16. When does job end? 16 ● It might seem that streaming jobs never end naturally ● Schema changes, new job versions, new engine versions - points when it’s worth to start another run
  • 17. When does dataset gets updated? 17 ● Dataset versioning is pretty important - bug analysis, data freshness ● Implicit - “last update timestamp”, Airflow’s data interval - OL default ● Explicit - Iceberg, Delta Lake dataset version
  • 18. When does dataset gets updated? 18 ● In streaming, it’s not so obvious as in batch ● Update on each row write would produce more metadata than actual data… ● Update only on potential job end would not give us any meaningful information in the meantime
  • 19. When does dataset gets updated? 19 ● Flink: maybe on checkpoint? ● Checkpointing is finicky, 100ms vs 10 minute checkpoint interval ● Configure minimum event emission interval separately ● OpenLineage’s additive model fits that really well ● Spark: microbatch?
  • 20. Dynamic transformation modification 20 ● KafkaSource can find new topic during execution when passed a wildcard pattern ● We can catch this and emit event containing this information when this happens
  • 22. OpenLineage has Flink integration! ● OpenLineage has Flink JobListener that notifies you on job start and end ● Support for Kafka, Iceberg, Cassandra, JDBC… ● Notifies you when job starts, ends, and on checkpoint with particular interval ● Additional metadata: schemas, how much data processed…
  • 23. Idea is simple, execution is more complex
  • 24. The integration has its limits ● Very limited, requires few undesirable things like setting execution.attached ● No SQL or Table API support! ● Need to manually attach JobListener to every job ● OpenLineage preferred solution would be to run listener on JobManager in a separate thread
  • 25. And the internals are even more complex ● Basically, a lot of reflection ● API wasn’t made for this use case, a lot of things are private, a lot of things are in the class internals ● OpenLineage preferred solution would be to have API for connectors to implement, where they would be responsible for providing correct data
  • 26. And even has evil hacks ● List of transformations inside StreamExecutionEnvironment gets cleared moment before calling JobListeners ● Before that happens, we replace the clearable list with one that keeps copy of data on `clear`.
  • 27. So, why bother? ● We’ve opportunistically created the integration despite limitations, to gather interest and provide even that limited value ● The long-term solution would be new API for Flink that would not have any of those limitations ○ Single API that for DataStream and SQL APIs ○ Not depending on any particular execution mode ○ Connectors responsible for their own lineage - testable and dependable! ○ No reflection :) ○ Possible to have Column-Level Lineage support in the future ● And we’ve waited in that state for a bit
  • 28. And then something happened ● FLIP-314 - Support Customized Flink Job Listener by Fang Yong, Zhanghao Chen ● New JobStatusChangedListener ○ JobCreatedEvent ○ JobExecutionStatusEvent ● JobCreatedEvent contains LineageGraph ● Both DataStream and SQL/Table API support ● No attachment problem ● Sounds perfect?
  • 30. Problem with LineageVertex ● How do you know all possible connector implementations?
  • 31. Problem with LineageVertex ● How do you know all connector implementations? ● How do you support custom connectors, where we can’t get the source? ○ …reflection?
  • 32. Problem with LineageVertex ● How do you know all connector implementations? ● How do you support custom connectors, for which the code is not known? ● How do you deal with breaking changes in connectors? ○ …even more reflection?
  • 33. Find a solution with community ● Voice your concern, propose how to resolve the issue ● Open discussion on Jira, Flink Slack, mailing list ● Managed to gain consensus and develop a solution that fits everyone involved ● Build community around lineage
  • 34. Resulting API is really nice
  • 35. Resulting API is really nice
  • 36. Facets Allow to Extend Data ● Directly inspired by OpenLineage facets ● Allow you to attach any atomic piece of metadata to your dataset or vertex metadata ● Both build-in into Flink - like DatasetSchemaFacet - and external, or specific per connector
  • 37. FLIP-314 will power OpenLineage ● Lineage driven by connectors is resilient ● Works for both DataStream and SQL/Table APIs ● Not dependant on any execution mode
  • 39. Support for other streaming systems ● Spark Streaming ● Kafka Connect ● …
  • 40. Column-level lineage support for Flink ● It’s a hard problem! ● But maybe not for SQL? ● UDFs definitely break simple solutions
  • 41. Native support for Spark connectors ● In contrast to Flink, Spark already has extension mechanism that allows you to view the internals of the job as it’s running - SparkListener ● We use LogicalPlan abstraction to extract metadata ● We have very similar issues as with Flink :) ● Internal vs external logical plan interfaces ● DataSourceV2 implementations
  • 42. Support for “raw” Kafka client ● It’s very popular to use raw client to build your own system, not only external systems ● bootstrap.servers is non unique and ambiguous - use Kafka cluster ID ● Execution is spread over multiple clients - but maybe not every one of them needs to always report
  • 43. OpenLineage is Open Source ● OpenLineage integrations are open source and open governance within LF AI & Data ● The best way to fix a problem is to fix it yourself :) ● Second best way is to be active and raise awareness ○ Maybe other people are also interested?