SlideShare a Scribd company logo
1 of 19
Download to read offline
OpenLineage &
Airflow - data lineage
has never been easier
May 2022
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Maciej Paweł
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
OpenLineage to build a healthy data ecosystem
Team A
Team C
Team B Interesting questions:
● What is the data
source?
● What is the schema?
● Who is the owner?
● How often is it
updated?
● Where does it come
from?
● Who is using it?
● What has changed?
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Infer or observe?
…or you can capture it
when the image is
originally created!
You can try to infer the
date and location of an
image after the fact…
rocks
26m until
sunset
haze
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
OpenLineage mission
● To define an open standard for the collection of lineage metadata from pipelines as they are running.
Graph DB
Backend
Producers
OpenLineage
Kafka topic
HTTP
client
Consumers
Kafka
client
GraphDB
client
Kafka
client
Kafka topic
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
● We want to achieve real-time notification about TaskInstance start, success, fail
● First way to do it? Subclassing DAG
- from airflow import DAG
+ from openlineage.airflow import DAG
● We can overload DAG methods and get notifications this way.
● Modify all the dags, have to set up openlineage-airflow locally.
The dark past of Airflow Integration
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
● We want to achieve real-time notification about TaskInstance start, success, fail
● First way to do it? Subclassing DAG
- from airflow import DAG
+ from openlineage.airflow import DAG
● We can overload DAG methods and get notifications this way.
● Modify all the dags, have to set up openlineage-airflow locally.
● Stopped working in Airflow 2
The dark past of Airflow Integration
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
● LineageBackend - sounds like right tool for the job?
● Both Airflow 1.10 and Airflow 2.1+ supported
● You can choose your LineageBackend in Airflow config
● Does not allow us to emit events on task start or failure
● We need those to reliably report what happened!
● Let’s contribute!
The closer, slightly dark past
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
● LineageBackend - sounds like right tool for the job?
● Both Airflow 1.10 and Airflow 2.1+ supported
● You can choose your LineageBackend in Airflow config
● Does not allow us to emit events on task start or failure
● We need those to reliably report what happened!
● Let’s contribute!
● Turns out it’s not so simple.
The closer, slightly dark past
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
● Let’s add new interface!
● We want our plugin to be notified when TaskInstanceState changes to RUNNING, SUCCESS, FAILED
The great present
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
● SqlAlchemy allows us to listen to existing database events
● AirflowPlugin mechanism allows us to automatically load plugin code from external Python packages
● Pluggy allows us to call registered plugins without needing to know what they are
The great present
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
● Okay
● Is present in Airflow 2.3!
The great present
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
● Extractors
● Built-in extractors -
○ BigQueryExtractor
○ SnowflakeExtractor
○ PostgresExtractor
○ GreatExpectationsExtractor
○ …
● Possibility to create custom extractors
Features
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
● Additional common library to help with writing extractors
● SQL parser
● Other integrations (dbt…) can use those features as well
Features
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
● Prevalence of PythonOperator
● Can we get data directly from Hooks?
● Hooks are very diverse.
The shiny future
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
● Prevalence of PythonOperator
● Can we get data directly from Hooks?
● Hooks are very diverse.
● AIP-48 solves a lot of those problems
The shiny future
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Top 3 recent features:
● Support for Spark 3.2.1
● Extensibility API
○ Possibility to write custom plugins to enrich existing OpenLineage events.
● Column level lineage
○ Which input columns were used to produce output column X?
Other:
● Spawning Spark from your Airflow DAG? We’ll keep track of that.
● Lifecycle state change - understand the meaning of DROP, DELETE, ALTER…
● Dataset versions for Iceberg & Delta
Apache Spark Integration
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Status:
● Under construction.
● We’re already able to:
○ Identify sources & sinks Kafka topics,
○ Fetch datasets’ schemas for Avro,
○ Include checkpoint statistics in OpenLineage events,
○ Retrieve information on Iceberg sources & sinks,
○ …
● Looking forward to publish first experimental version.
Apache Flink integration
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Status:
● github.com/OpenLineage/OpenLineage
You can contribute too!

More Related Content

More from GetInData

Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...GetInData
 
Feast + Amundsen Integration - Mariusz Strzelecki, GetInData
Feast + Amundsen Integration - Mariusz Strzelecki, GetInDataFeast + Amundsen Integration - Mariusz Strzelecki, GetInData
Feast + Amundsen Integration - Mariusz Strzelecki, GetInDataGetInData
 
Kubernetes and real-time analytics - how to connect these two worlds with Apa...
Kubernetes and real-time analytics - how to connect these two worlds with Apa...Kubernetes and real-time analytics - how to connect these two worlds with Apa...
Kubernetes and real-time analytics - how to connect these two worlds with Apa...GetInData
 
Big data trends - Krzysztof Zarzycki, GetInData
Big data trends - Krzysztof Zarzycki, GetInDataBig data trends - Krzysztof Zarzycki, GetInData
Big data trends - Krzysztof Zarzycki, GetInDataGetInData
 
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...GetInData
 
Analytics 101 - How to build a data-driven organisation? - Rafał Małanij, Get...
Analytics 101 - How to build a data-driven organisation? - Rafał Małanij, Get...Analytics 101 - How to build a data-driven organisation? - Rafał Małanij, Get...
Analytics 101 - How to build a data-driven organisation? - Rafał Małanij, Get...GetInData
 
Monitoring in Big Data Platform - Albert Lewandowski, GetInData
Monitoring in Big Data Platform - Albert Lewandowski, GetInDataMonitoring in Big Data Platform - Albert Lewandowski, GetInData
Monitoring in Big Data Platform - Albert Lewandowski, GetInDataGetInData
 
Complex event processing platform handling millions of users - Krzysztof Zarz...
Complex event processing platform handling millions of users - Krzysztof Zarz...Complex event processing platform handling millions of users - Krzysztof Zarz...
Complex event processing platform handling millions of users - Krzysztof Zarz...GetInData
 
Predicting Startup Market Trends based on the news and social media - Albert ...
Predicting Startup Market Trends based on the news and social media - Albert ...Predicting Startup Market Trends based on the news and social media - Albert ...
Predicting Startup Market Trends based on the news and social media - Albert ...GetInData
 
Managing Big Data projects in a constantly changing environment - Rafał Zalew...
Managing Big Data projects in a constantly changing environment - Rafał Zalew...Managing Big Data projects in a constantly changing environment - Rafał Zalew...
Managing Big Data projects in a constantly changing environment - Rafał Zalew...GetInData
 
NLP for videos: Understanding customers' feelings in videos - Albert Lewandow...
NLP for videos: Understanding customers' feelings in videos - Albert Lewandow...NLP for videos: Understanding customers' feelings in videos - Albert Lewandow...
NLP for videos: Understanding customers' feelings in videos - Albert Lewandow...GetInData
 
Strategies for on premise to Google Cloud migration - Mateusz Pytel, GetInData
Strategies for on premise to Google Cloud migration - Mateusz Pytel, GetInDataStrategies for on premise to Google Cloud migration - Mateusz Pytel, GetInData
Strategies for on premise to Google Cloud migration - Mateusz Pytel, GetInDataGetInData
 
Monitoring environment based on satellite data with Python and PySpark - Albe...
Monitoring environment based on satellite data with Python and PySpark - Albe...Monitoring environment based on satellite data with Python and PySpark - Albe...
Monitoring environment based on satellite data with Python and PySpark - Albe...GetInData
 
Welcome to MLOps candy shop and choose your flavour! - Mateusz Pytel & Marius...
Welcome to MLOps candy shop and choose your flavour! - Mateusz Pytel & Marius...Welcome to MLOps candy shop and choose your flavour! - Mateusz Pytel & Marius...
Welcome to MLOps candy shop and choose your flavour! - Mateusz Pytel & Marius...GetInData
 
Real time analytics that controls 50% of mobile network in Poland - Maciej Br...
Real time analytics that controls 50% of mobile network in Poland - Maciej Br...Real time analytics that controls 50% of mobile network in Poland - Maciej Br...
Real time analytics that controls 50% of mobile network in Poland - Maciej Br...GetInData
 
How to maximize profit from IoT by using data platform - Albert Lewandowski, ...
How to maximize profit from IoT by using data platform - Albert Lewandowski, ...How to maximize profit from IoT by using data platform - Albert Lewandowski, ...
How to maximize profit from IoT by using data platform - Albert Lewandowski, ...GetInData
 
Understanding Big Data Analytics - solutions for growing businesses - Rafał M...
Understanding Big Data Analytics - solutions for growing businesses - Rafał M...Understanding Big Data Analytics - solutions for growing businesses - Rafał M...
Understanding Big Data Analytics - solutions for growing businesses - Rafał M...GetInData
 
Hot to build continuously processing for 24/7 real-time data streaming platform?
Hot to build continuously processing for 24/7 real-time data streaming platform?Hot to build continuously processing for 24/7 real-time data streaming platform?
Hot to build continuously processing for 24/7 real-time data streaming platform?GetInData
 
Zdarzeniowe zasilanie i przeliczanie danych w nowoczesnej hurtowni danych opa...
Zdarzeniowe zasilanie i przeliczanie danych w nowoczesnej hurtowni danych opa...Zdarzeniowe zasilanie i przeliczanie danych w nowoczesnej hurtowni danych opa...
Zdarzeniowe zasilanie i przeliczanie danych w nowoczesnej hurtowni danych opa...GetInData
 
Open-source vs. public cloud in the Big Data landscape. Friends or Foes?
Open-source vs. public cloud in the Big Data landscape. Friends or Foes?Open-source vs. public cloud in the Big Data landscape. Friends or Foes?
Open-source vs. public cloud in the Big Data landscape. Friends or Foes?GetInData
 

More from GetInData (20)

Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
 
Feast + Amundsen Integration - Mariusz Strzelecki, GetInData
Feast + Amundsen Integration - Mariusz Strzelecki, GetInDataFeast + Amundsen Integration - Mariusz Strzelecki, GetInData
Feast + Amundsen Integration - Mariusz Strzelecki, GetInData
 
Kubernetes and real-time analytics - how to connect these two worlds with Apa...
Kubernetes and real-time analytics - how to connect these two worlds with Apa...Kubernetes and real-time analytics - how to connect these two worlds with Apa...
Kubernetes and real-time analytics - how to connect these two worlds with Apa...
 
Big data trends - Krzysztof Zarzycki, GetInData
Big data trends - Krzysztof Zarzycki, GetInDataBig data trends - Krzysztof Zarzycki, GetInData
Big data trends - Krzysztof Zarzycki, GetInData
 
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
 
Analytics 101 - How to build a data-driven organisation? - Rafał Małanij, Get...
Analytics 101 - How to build a data-driven organisation? - Rafał Małanij, Get...Analytics 101 - How to build a data-driven organisation? - Rafał Małanij, Get...
Analytics 101 - How to build a data-driven organisation? - Rafał Małanij, Get...
 
Monitoring in Big Data Platform - Albert Lewandowski, GetInData
Monitoring in Big Data Platform - Albert Lewandowski, GetInDataMonitoring in Big Data Platform - Albert Lewandowski, GetInData
Monitoring in Big Data Platform - Albert Lewandowski, GetInData
 
Complex event processing platform handling millions of users - Krzysztof Zarz...
Complex event processing platform handling millions of users - Krzysztof Zarz...Complex event processing platform handling millions of users - Krzysztof Zarz...
Complex event processing platform handling millions of users - Krzysztof Zarz...
 
Predicting Startup Market Trends based on the news and social media - Albert ...
Predicting Startup Market Trends based on the news and social media - Albert ...Predicting Startup Market Trends based on the news and social media - Albert ...
Predicting Startup Market Trends based on the news and social media - Albert ...
 
Managing Big Data projects in a constantly changing environment - Rafał Zalew...
Managing Big Data projects in a constantly changing environment - Rafał Zalew...Managing Big Data projects in a constantly changing environment - Rafał Zalew...
Managing Big Data projects in a constantly changing environment - Rafał Zalew...
 
NLP for videos: Understanding customers' feelings in videos - Albert Lewandow...
NLP for videos: Understanding customers' feelings in videos - Albert Lewandow...NLP for videos: Understanding customers' feelings in videos - Albert Lewandow...
NLP for videos: Understanding customers' feelings in videos - Albert Lewandow...
 
Strategies for on premise to Google Cloud migration - Mateusz Pytel, GetInData
Strategies for on premise to Google Cloud migration - Mateusz Pytel, GetInDataStrategies for on premise to Google Cloud migration - Mateusz Pytel, GetInData
Strategies for on premise to Google Cloud migration - Mateusz Pytel, GetInData
 
Monitoring environment based on satellite data with Python and PySpark - Albe...
Monitoring environment based on satellite data with Python and PySpark - Albe...Monitoring environment based on satellite data with Python and PySpark - Albe...
Monitoring environment based on satellite data with Python and PySpark - Albe...
 
Welcome to MLOps candy shop and choose your flavour! - Mateusz Pytel & Marius...
Welcome to MLOps candy shop and choose your flavour! - Mateusz Pytel & Marius...Welcome to MLOps candy shop and choose your flavour! - Mateusz Pytel & Marius...
Welcome to MLOps candy shop and choose your flavour! - Mateusz Pytel & Marius...
 
Real time analytics that controls 50% of mobile network in Poland - Maciej Br...
Real time analytics that controls 50% of mobile network in Poland - Maciej Br...Real time analytics that controls 50% of mobile network in Poland - Maciej Br...
Real time analytics that controls 50% of mobile network in Poland - Maciej Br...
 
How to maximize profit from IoT by using data platform - Albert Lewandowski, ...
How to maximize profit from IoT by using data platform - Albert Lewandowski, ...How to maximize profit from IoT by using data platform - Albert Lewandowski, ...
How to maximize profit from IoT by using data platform - Albert Lewandowski, ...
 
Understanding Big Data Analytics - solutions for growing businesses - Rafał M...
Understanding Big Data Analytics - solutions for growing businesses - Rafał M...Understanding Big Data Analytics - solutions for growing businesses - Rafał M...
Understanding Big Data Analytics - solutions for growing businesses - Rafał M...
 
Hot to build continuously processing for 24/7 real-time data streaming platform?
Hot to build continuously processing for 24/7 real-time data streaming platform?Hot to build continuously processing for 24/7 real-time data streaming platform?
Hot to build continuously processing for 24/7 real-time data streaming platform?
 
Zdarzeniowe zasilanie i przeliczanie danych w nowoczesnej hurtowni danych opa...
Zdarzeniowe zasilanie i przeliczanie danych w nowoczesnej hurtowni danych opa...Zdarzeniowe zasilanie i przeliczanie danych w nowoczesnej hurtowni danych opa...
Zdarzeniowe zasilanie i przeliczanie danych w nowoczesnej hurtowni danych opa...
 
Open-source vs. public cloud in the Big Data landscape. Friends or Foes?
Open-source vs. public cloud in the Big Data landscape. Friends or Foes?Open-source vs. public cloud in the Big Data landscape. Friends or Foes?
Open-source vs. public cloud in the Big Data landscape. Friends or Foes?
 

OpenLineage & Airflow - data lineage has never been easier

  • 1. OpenLineage & Airflow - data lineage has never been easier May 2022
  • 2. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Maciej Paweł
  • 3. © Copyright. All rights reserved. Not to be reproduced without prior written consent. OpenLineage to build a healthy data ecosystem Team A Team C Team B Interesting questions: ● What is the data source? ● What is the schema? ● Who is the owner? ● How often is it updated? ● Where does it come from? ● Who is using it? ● What has changed?
  • 4. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Infer or observe? …or you can capture it when the image is originally created! You can try to infer the date and location of an image after the fact… rocks 26m until sunset haze
  • 5. © Copyright. All rights reserved. Not to be reproduced without prior written consent. OpenLineage mission ● To define an open standard for the collection of lineage metadata from pipelines as they are running. Graph DB Backend Producers OpenLineage Kafka topic HTTP client Consumers Kafka client GraphDB client Kafka client Kafka topic
  • 6. © Copyright. All rights reserved. Not to be reproduced without prior written consent. ● We want to achieve real-time notification about TaskInstance start, success, fail ● First way to do it? Subclassing DAG - from airflow import DAG + from openlineage.airflow import DAG ● We can overload DAG methods and get notifications this way. ● Modify all the dags, have to set up openlineage-airflow locally. The dark past of Airflow Integration
  • 7. © Copyright. All rights reserved. Not to be reproduced without prior written consent. ● We want to achieve real-time notification about TaskInstance start, success, fail ● First way to do it? Subclassing DAG - from airflow import DAG + from openlineage.airflow import DAG ● We can overload DAG methods and get notifications this way. ● Modify all the dags, have to set up openlineage-airflow locally. ● Stopped working in Airflow 2 The dark past of Airflow Integration
  • 8. © Copyright. All rights reserved. Not to be reproduced without prior written consent. ● LineageBackend - sounds like right tool for the job? ● Both Airflow 1.10 and Airflow 2.1+ supported ● You can choose your LineageBackend in Airflow config ● Does not allow us to emit events on task start or failure ● We need those to reliably report what happened! ● Let’s contribute! The closer, slightly dark past
  • 9. © Copyright. All rights reserved. Not to be reproduced without prior written consent. ● LineageBackend - sounds like right tool for the job? ● Both Airflow 1.10 and Airflow 2.1+ supported ● You can choose your LineageBackend in Airflow config ● Does not allow us to emit events on task start or failure ● We need those to reliably report what happened! ● Let’s contribute! ● Turns out it’s not so simple. The closer, slightly dark past
  • 10. © Copyright. All rights reserved. Not to be reproduced without prior written consent. ● Let’s add new interface! ● We want our plugin to be notified when TaskInstanceState changes to RUNNING, SUCCESS, FAILED The great present
  • 11. © Copyright. All rights reserved. Not to be reproduced without prior written consent. ● SqlAlchemy allows us to listen to existing database events ● AirflowPlugin mechanism allows us to automatically load plugin code from external Python packages ● Pluggy allows us to call registered plugins without needing to know what they are The great present
  • 12. © Copyright. All rights reserved. Not to be reproduced without prior written consent. ● Okay ● Is present in Airflow 2.3! The great present
  • 13. © Copyright. All rights reserved. Not to be reproduced without prior written consent. ● Extractors ● Built-in extractors - ○ BigQueryExtractor ○ SnowflakeExtractor ○ PostgresExtractor ○ GreatExpectationsExtractor ○ … ● Possibility to create custom extractors Features
  • 14. © Copyright. All rights reserved. Not to be reproduced without prior written consent. ● Additional common library to help with writing extractors ● SQL parser ● Other integrations (dbt…) can use those features as well Features
  • 15. © Copyright. All rights reserved. Not to be reproduced without prior written consent. ● Prevalence of PythonOperator ● Can we get data directly from Hooks? ● Hooks are very diverse. The shiny future
  • 16. © Copyright. All rights reserved. Not to be reproduced without prior written consent. ● Prevalence of PythonOperator ● Can we get data directly from Hooks? ● Hooks are very diverse. ● AIP-48 solves a lot of those problems The shiny future
  • 17. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Top 3 recent features: ● Support for Spark 3.2.1 ● Extensibility API ○ Possibility to write custom plugins to enrich existing OpenLineage events. ● Column level lineage ○ Which input columns were used to produce output column X? Other: ● Spawning Spark from your Airflow DAG? We’ll keep track of that. ● Lifecycle state change - understand the meaning of DROP, DELETE, ALTER… ● Dataset versions for Iceberg & Delta Apache Spark Integration
  • 18. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Status: ● Under construction. ● We’re already able to: ○ Identify sources & sinks Kafka topics, ○ Fetch datasets’ schemas for Avro, ○ Include checkpoint statistics in OpenLineage events, ○ Retrieve information on Iceberg sources & sinks, ○ … ● Looking forward to publish first experimental version. Apache Flink integration
  • 19. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Status: ● github.com/OpenLineage/OpenLineage You can contribute too!