SlideShare a Scribd company logo
Marek Novotny, ABSA
Vaclav Kosar, ABSA
Spline: Data Lineage for
Spark Structured Streaming
#SAISExp18
About Us
•ABSA is a Pan-African financial services provider
– With Apache Spark at the core of its data engineering
•We try to fill gaps in the Hadoop eco-system
•Contributions to Apache Spark
•Spark-related open-source projects (github.com/AbsaOSS)
– ABRiS – Avro SerDe for structured APIs (#SAISDev5)
– Cobrix – Cobol data source
– Atum – Completeness and accuracy library
– Spline – Data lineage tracking and visualization tool (#EUent3)
2#SAISExp18
• How data is calculated?
• What is the schema and format of
streamed data?
3#SAISExp18
01000110101101010
4#SAISExp18
Data Flow
Job 2
Job 3
Job 1
01000110101101010
Topic A
Topic B
Topic Z
Topic D
Path /…/abc
5#SAISExp18
Transformations Job 3 Details
Topic D //path/
Join
colA + colB
Topic Z
Job 2
Job 3
Job 1
01000110101101010
Topic A
Topic B
Topic Z
Topic D
Path /…/abc
6#SAISExp18
Schema A
Schema B
Schema C
Schema D
Schema Z
Schema C
Schema D
Job 2
Job 3
Job 1
Schemas and Formats
01000110101101010
Topic A
Topic B
Topic Z
Topic D
Path /…/abc
Spline
7#SAISExp18
•To make Spark BCBS
(Clarity) compliant
•To communicate with
business people
Spline
8#SAISExp18
Dependencies
•To make Spark BCBS
(Clarity) compliant
•To communicate with
business people
•Online documentation of
–Job dependencies
Spline
9#SAISExp18
Dependencies Details
•To make Spark BCBS
(Clarity) compliant
•To communicate with
business people
•Online documentation of
–Job dependencies
– Particular Spark SQL jobs
Spline
•To make Spark BCBS
(Clarity) compliant
•To communicate with
business people
•Online documentation of
–Job dependencies
– Spark SQL job details
– Attributes occurring in the
logic
10#SAISExp18
Dependencies Details Attributes
Lineage Tracking of Batch Jobs
• Dataset-oriented
• Leverages execution plans
• Structured APIs only
– SQL
– Dataframes
– Datasets
• UDFs and lambdas are
considered as black boxes
11#SAISExp18
Job
Dataset A
Lineage A
Lineage Tracking of Streaming Jobs
• Structured Streaming only
• Source-oriented (topic)
• Evolves in time
12#SAISExp18
App
Lineage T1
Topic A
Time
Lineage T3
Lineage T2
Structured Streaming Support
13#SAISExp18
Spark libraries
Transformations
Session
Query
Spark structured streaming job
StreamingQueryManager
• StreamingQueryManager
Start
Structured Streaming Support
14#SAISExp18
StreamingQueryManager
Spline Streaming Listener
Spark libraries
Transformations
Session
Query
Spark structured streaming job• StreamingQueryManager
– Information about start
Start
Structured Streaming Support
15#SAISExp18
StreamingQueryManager
Spline Streaming Listener
Spark libraries
Transformations
Session
Query
Spark structured streaming job
Give me exec. plans
• StreamingQueryManager
– Information about start
– Can provide execution
plans
Start
Structured Streaming Support
16#SAISExp18
StreamingQueryManager
Spline Streaming Listener
Spark libraries
Transformations
Session
Query
Lineage Model
Spline UI
Spark structured streaming job
Execution Plans
• StreamingQueryManager
– Information about start
– Can provide execution
plans
Start
Structured Streaming Support
17#SAISExp18
StreamingQueryManager
Spline Streaming Listener
Spark libraries
Transformations
Session
Query
Lineage Model
Spline UI
Spark structured streaming job
Event details
ProgressExecution Plans
• StreamingQueryManager
– Information about start
– Can provide execution
plans
– Information about progress
• MicroBatch
Interval View
• Displays data flow in fixed interval
18#SAISExp18
Start End Time
progress
Job W1
Job R
Job W2 progress
progress progress
progress progress
Interval
Job W1
Job R
S1
S2
Job W2
S3
Demo – Use Case
19#SAISExp18
What is temperature per hour in Prague?
Station 2 Station NStation 1
?
Prague
…
Demo – Use Case Output
20#SAISExp18
0
5
10
15
20
25
30
35
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Temperature[°C]
Hours Start End
2018-09-24
Demo – Select Interval View
21#SAISExp18
Demo – Select Interval
22#SAISExp18
Demo – Select Sink
23#SAISExp18
Demo – Find Highlighted Sink
24#SAISExp18
Demo – Review The Lineage
25#SAISExp18
Demo – Change The Interval
26#SAISExp18
Demo – Observe New Lineage
27#SAISExp18
Demo – Select A Job
28#SAISExp18
Demo – Drill Down
29#SAISExp18
Demo – Review Job Details
30#SAISExp18
Demo – Select An Operation
31#SAISExp18
Demo – See Operation Attributes
32#SAISExp18
Interval View Limitations
33#SAISExp18
Start End Time
progress
Job W1
Job R
Job W2 progress
progress progress
progress progress
Interval
Job R
S1
S2 S310:21 10:25
10:30 10:35
10:45 10:51
Interval View Limitations
34#SAISExp18
Job W1
Job R
Start End Time
progress
Job W1
Job R
Job W2 progress
progress progress
progress progress
S1
S2
Interval View
Interval
10:21 10:25
10:30 10:35
10:45 10:51
Interval View Limitations
• Edge case (delayed read, early write)
– Job W1 should be linked
– Job W2 should not be linked
35#SAISExp18
Job W1
Job R
Start End Time
progress
Job W1
Job R
Job W2 progress
progress progress
progress progress
S1
S2
Job W2
Job R
S1
S3
Lineage
Interval
Interval View
10:21 10:25
10:30 10:35
10:45 10:51
Beyond The Interval View
• Instead of timestamp use
addresses of rows
• SS has addresses (offsets) on
each source, but not on sinks
• Most sinks are also sources and
thus could return offsets
36#SAISExp18
Source 2
Offsets:
3 - 5
Job
Source 3
Offsets:
12 - 14
Sink
Offsets:
?
Progress Event
Offset-Based Linking
37#SAISExp18
offset
offset
offset
Selected
S3
S2
S1
S1
Offset-Based Linking
38#SAISExp18
Job R Progress
offset
offset
offset
S3
S2
S1
Job R
S1
Selected
Offset-Based Linking
39#SAISExp18
Job R Progress
offset
offset
offset
3 - 5
12 - 14
S3
S2
S1
Job R
S1
S2 S3
3 - 5 12 - 14
Selected
Offset-Based Linking
40#SAISExp18
Job W2 Progress
Job R Progress
offset
offset
offset
3 - 5
12 - 14
S3
S2
S1
Job R
S1
S2
Job W2
S3
3 - 5
9 - 19
12 - 14
Selected
Offset-Based Linking
41#SAISExp18
Job W2 Progress
Job W1
Progress
Job R Progress
offset
offset
offset
3 - 5
12 - 14
22 - 27
S3
S2
S1
Job X
Progress
Job W1
Job R
S1
S2
Job W2
S3
3 - 5
22 - 27 9 - 19
12 - 14
Selected
Offset-Based Linking
• Jobs are linked when progress offsets overlap
• Offset timestamp doesn’t matter
42#SAISExp18
Job W1
Job R
S1
S2
Job W2
S3
3 - 5
22 - 27 9 - 19
12 - 14
Job W2 Progress
Job R Progress
offset
offset
offset
3 - 5
12 - 14
22 - 27
S3
S2
S1
Job W1
Progress
Job X
Progress
Selected
Conclusion
• Spline: data lineage tracking tool
• New support for Structured Streaming
• Demo POC: Interval View
• Proposed generalization: offset-based linking
43#SAISExp18
Future Plans
• Release Interval View in Spline
• After changes to Spark:
– Offset based linking for micro-batch streaming
– Continuous streaming support
• Support for dataset checkpoints
44#SAISExp18
Questions
• Now is a good time
• Or feel free to contact us
– Marek Novotny
• mn.mikke@gmail.com
– Vaclav Kosar
• admin@vaclavkosar.com
• github.com/AbsaOSS/spline
45#SAISExp18

More Related Content

What's hot

Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
sudhakara st
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
DataArt
 
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Databricks
 
Hadoop Security Architecture
Hadoop Security ArchitectureHadoop Security Architecture
Hadoop Security Architecture
Owen O'Malley
 
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
DataScienceConferenc1
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.
 
Apache Spark Data Validation
Apache Spark Data ValidationApache Spark Data Validation
Apache Spark Data Validation
Databricks
 
Understanding and Improving Code Generation
Understanding and Improving Code GenerationUnderstanding and Improving Code Generation
Understanding and Improving Code Generation
Databricks
 
Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Monitor Apache Spark 3 on Kubernetes using Metrics and PluginsMonitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Databricks
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Databricks
 
Apache Spark Components
Apache Spark ComponentsApache Spark Components
Apache Spark Components
Girish Khanzode
 
Processing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesProcessing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial Services
Flink Forward
 
Apache NiFi Crash Course Intro
Apache NiFi Crash Course IntroApache NiFi Crash Course Intro
Apache NiFi Crash Course Intro
DataWorks Summit/Hadoop Summit
 
Microsoft Azure Data Factory Hands-On Lab Overview Slides
Microsoft Azure Data Factory Hands-On Lab Overview SlidesMicrosoft Azure Data Factory Hands-On Lab Overview Slides
Microsoft Azure Data Factory Hands-On Lab Overview Slides
Mark Kromer
 
Apache Spark on K8S and HDFS Security with Ilan Flonenko
Apache Spark on K8S and HDFS Security with Ilan FlonenkoApache Spark on K8S and HDFS Security with Ilan Flonenko
Apache Spark on K8S and HDFS Security with Ilan Flonenko
Databricks
 
Improved alerting with Prometheus and Alertmanager
Improved alerting with Prometheus and AlertmanagerImproved alerting with Prometheus and Alertmanager
Improved alerting with Prometheus and Alertmanager
Julien Pivotto
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured Streaming
Databricks
 
Snowflake Architecture.pptx
Snowflake Architecture.pptxSnowflake Architecture.pptx
Snowflake Architecture.pptx
chennakesava44
 
Hadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job PerformanceHadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job Performance
Cloudera, Inc.
 

What's hot (20)

Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
 
Hadoop Security Architecture
Hadoop Security ArchitectureHadoop Security Architecture
Hadoop Security Architecture
 
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Apache Spark Data Validation
Apache Spark Data ValidationApache Spark Data Validation
Apache Spark Data Validation
 
Understanding and Improving Code Generation
Understanding and Improving Code GenerationUnderstanding and Improving Code Generation
Understanding and Improving Code Generation
 
Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Monitor Apache Spark 3 on Kubernetes using Metrics and PluginsMonitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
 
Apache Spark Components
Apache Spark ComponentsApache Spark Components
Apache Spark Components
 
Processing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesProcessing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial Services
 
Apache NiFi Crash Course Intro
Apache NiFi Crash Course IntroApache NiFi Crash Course Intro
Apache NiFi Crash Course Intro
 
Microsoft Azure Data Factory Hands-On Lab Overview Slides
Microsoft Azure Data Factory Hands-On Lab Overview SlidesMicrosoft Azure Data Factory Hands-On Lab Overview Slides
Microsoft Azure Data Factory Hands-On Lab Overview Slides
 
Apache Spark on K8S and HDFS Security with Ilan Flonenko
Apache Spark on K8S and HDFS Security with Ilan FlonenkoApache Spark on K8S and HDFS Security with Ilan Flonenko
Apache Spark on K8S and HDFS Security with Ilan Flonenko
 
Improved alerting with Prometheus and Alertmanager
Improved alerting with Prometheus and AlertmanagerImproved alerting with Prometheus and Alertmanager
Improved alerting with Prometheus and Alertmanager
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured Streaming
 
Snowflake Architecture.pptx
Snowflake Architecture.pptxSnowflake Architecture.pptx
Snowflake Architecture.pptx
 
Hadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job PerformanceHadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job Performance
 

Similar to Spline: Data Lineage For Spark Structured Streaming

HBaseCon2017 Data Product at AirBnB
HBaseCon2017 Data Product at AirBnBHBaseCon2017 Data Product at AirBnB
HBaseCon2017 Data Product at AirBnB
HBaseCon
 
SplunkSummit 2015 - Update on Splunk Enterprise 6.3 & Hunk 6.3
SplunkSummit 2015 - Update on Splunk Enterprise 6.3 & Hunk 6.3SplunkSummit 2015 - Update on Splunk Enterprise 6.3 & Hunk 6.3
SplunkSummit 2015 - Update on Splunk Enterprise 6.3 & Hunk 6.3
Splunk
 
Compiling openCypher graph queries with Spark Catalyst
Compiling openCypher graph queries with Spark CatalystCompiling openCypher graph queries with Spark Catalyst
Compiling openCypher graph queries with Spark Catalyst
Gábor Szárnyas
 
Apache HAWQ Architecture
Apache HAWQ ArchitectureApache HAWQ Architecture
Apache HAWQ Architecture
Alexey Grishchenko
 
Streaming SQL to unify batch and stream processing: Theory and practice with ...
Streaming SQL to unify batch and stream processing: Theory and practice with ...Streaming SQL to unify batch and stream processing: Theory and practice with ...
Streaming SQL to unify batch and stream processing: Theory and practice with ...
Fabian Hueske
 
Building Stream Processing as a Service
Building Stream Processing as a ServiceBuilding Stream Processing as a Service
Building Stream Processing as a Service
Steven Wu
 
Streaming SQL Foundations: Why I ❤ Streams+Tables
Streaming SQL Foundations: Why I ❤ Streams+TablesStreaming SQL Foundations: Why I ❤ Streams+Tables
Streaming SQL Foundations: Why I ❤ Streams+Tables
C4Media
 
Reactive database access with Slick3
Reactive database access with Slick3Reactive database access with Slick3
Reactive database access with Slick3
takezoe
 
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Databricks
 
cpmpertmy (1)
cpmpertmy (1)cpmpertmy (1)
cpmpertmy (1)
Dr. Divya Khandelwal
 
Fascinate with SQL SSIS Parallel processing
Fascinate with SQL SSIS Parallel processing Fascinate with SQL SSIS Parallel processing
Fascinate with SQL SSIS Parallel processing
Vishal Pawar
 
Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...
Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...
Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...
Databricks
 
StructuredStreaming webinar slides.pptx
StructuredStreaming webinar slides.pptxStructuredStreaming webinar slides.pptx
StructuredStreaming webinar slides.pptx
GianCarloPoggiEscoba1
 
Aminullah Assagaf_P7-Ch.9_Project management-32.pptx
Aminullah Assagaf_P7-Ch.9_Project management-32.pptxAminullah Assagaf_P7-Ch.9_Project management-32.pptx
Aminullah Assagaf_P7-Ch.9_Project management-32.pptx
Aminullah Assagaf
 
Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming ...
Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming ...Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming ...
Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming ...
Lightbend
 
AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Ac...
AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Ac...AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Ac...
AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Ac...
Amazon Web Services
 
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringApache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Databricks
 
Architecting a next generation data platform
Architecting a next generation data platformArchitecting a next generation data platform
Architecting a next generation data platform
hadooparchbook
 
Data Pipeline for The Big Data/Data Science OKC
Data Pipeline for The Big Data/Data Science OKCData Pipeline for The Big Data/Data Science OKC
Data Pipeline for The Big Data/Data Science OKC
Mark Smith
 
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to StreamingBravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
Yaroslav Tkachenko
 

Similar to Spline: Data Lineage For Spark Structured Streaming (20)

HBaseCon2017 Data Product at AirBnB
HBaseCon2017 Data Product at AirBnBHBaseCon2017 Data Product at AirBnB
HBaseCon2017 Data Product at AirBnB
 
SplunkSummit 2015 - Update on Splunk Enterprise 6.3 & Hunk 6.3
SplunkSummit 2015 - Update on Splunk Enterprise 6.3 & Hunk 6.3SplunkSummit 2015 - Update on Splunk Enterprise 6.3 & Hunk 6.3
SplunkSummit 2015 - Update on Splunk Enterprise 6.3 & Hunk 6.3
 
Compiling openCypher graph queries with Spark Catalyst
Compiling openCypher graph queries with Spark CatalystCompiling openCypher graph queries with Spark Catalyst
Compiling openCypher graph queries with Spark Catalyst
 
Apache HAWQ Architecture
Apache HAWQ ArchitectureApache HAWQ Architecture
Apache HAWQ Architecture
 
Streaming SQL to unify batch and stream processing: Theory and practice with ...
Streaming SQL to unify batch and stream processing: Theory and practice with ...Streaming SQL to unify batch and stream processing: Theory and practice with ...
Streaming SQL to unify batch and stream processing: Theory and practice with ...
 
Building Stream Processing as a Service
Building Stream Processing as a ServiceBuilding Stream Processing as a Service
Building Stream Processing as a Service
 
Streaming SQL Foundations: Why I ❤ Streams+Tables
Streaming SQL Foundations: Why I ❤ Streams+TablesStreaming SQL Foundations: Why I ❤ Streams+Tables
Streaming SQL Foundations: Why I ❤ Streams+Tables
 
Reactive database access with Slick3
Reactive database access with Slick3Reactive database access with Slick3
Reactive database access with Slick3
 
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
 
cpmpertmy (1)
cpmpertmy (1)cpmpertmy (1)
cpmpertmy (1)
 
Fascinate with SQL SSIS Parallel processing
Fascinate with SQL SSIS Parallel processing Fascinate with SQL SSIS Parallel processing
Fascinate with SQL SSIS Parallel processing
 
Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...
Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...
Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...
 
StructuredStreaming webinar slides.pptx
StructuredStreaming webinar slides.pptxStructuredStreaming webinar slides.pptx
StructuredStreaming webinar slides.pptx
 
Aminullah Assagaf_P7-Ch.9_Project management-32.pptx
Aminullah Assagaf_P7-Ch.9_Project management-32.pptxAminullah Assagaf_P7-Ch.9_Project management-32.pptx
Aminullah Assagaf_P7-Ch.9_Project management-32.pptx
 
Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming ...
Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming ...Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming ...
Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming ...
 
AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Ac...
AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Ac...AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Ac...
AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Ac...
 
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringApache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
 
Architecting a next generation data platform
Architecting a next generation data platformArchitecting a next generation data platform
Architecting a next generation data platform
 
Data Pipeline for The Big Data/Data Science OKC
Data Pipeline for The Big Data/Data Science OKCData Pipeline for The Big Data/Data Science OKC
Data Pipeline for The Big Data/Data Science OKC
 
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to StreamingBravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
 

More from Vaclav Kosar

Conversation with-search-engines (Ren et al. 2020)
Conversation with-search-engines (Ren et al. 2020)Conversation with-search-engines (Ren et al. 2020)
Conversation with-search-engines (Ren et al. 2020)
Vaclav Kosar
 
FastText Vector Norms And OOV Words
FastText Vector Norms And OOV WordsFastText Vector Norms And OOV Words
FastText Vector Norms And OOV Words
Vaclav Kosar
 
Simulation of Soft Photon Calorimeter @ 2011 JINR, Dubna Student Practice
Simulation of Soft Photon Calorimeter @ 2011 JINR, Dubna Student PracticeSimulation of Soft Photon Calorimeter @ 2011 JINR, Dubna Student Practice
Simulation of Soft Photon Calorimeter @ 2011 JINR, Dubna Student Practice
Vaclav Kosar
 
Spline 0.3 and Plans for 0.4
Spline 0.3 and Plans for 0.4 Spline 0.3 and Plans for 0.4
Spline 0.3 and Plans for 0.4
Vaclav Kosar
 
Spline 0.3 User Guide
Spline 0.3 User GuideSpline 0.3 User Guide
Spline 0.3 User Guide
Vaclav Kosar
 
Spline 2 - Vision and Architecture Overview
Spline 2 - Vision and Architecture OverviewSpline 2 - Vision and Architecture Overview
Spline 2 - Vision and Architecture Overview
Vaclav Kosar
 

More from Vaclav Kosar (6)

Conversation with-search-engines (Ren et al. 2020)
Conversation with-search-engines (Ren et al. 2020)Conversation with-search-engines (Ren et al. 2020)
Conversation with-search-engines (Ren et al. 2020)
 
FastText Vector Norms And OOV Words
FastText Vector Norms And OOV WordsFastText Vector Norms And OOV Words
FastText Vector Norms And OOV Words
 
Simulation of Soft Photon Calorimeter @ 2011 JINR, Dubna Student Practice
Simulation of Soft Photon Calorimeter @ 2011 JINR, Dubna Student PracticeSimulation of Soft Photon Calorimeter @ 2011 JINR, Dubna Student Practice
Simulation of Soft Photon Calorimeter @ 2011 JINR, Dubna Student Practice
 
Spline 0.3 and Plans for 0.4
Spline 0.3 and Plans for 0.4 Spline 0.3 and Plans for 0.4
Spline 0.3 and Plans for 0.4
 
Spline 0.3 User Guide
Spline 0.3 User GuideSpline 0.3 User Guide
Spline 0.3 User Guide
 
Spline 2 - Vision and Architecture Overview
Spline 2 - Vision and Architecture OverviewSpline 2 - Vision and Architecture Overview
Spline 2 - Vision and Architecture Overview
 

Recently uploaded

Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Zilliz
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
Pixlogix Infotech
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Zilliz
 
Data structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdfData structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdf
TIPNGVN2
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
Rohit Gautam
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 

Recently uploaded (20)

Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
 
Data structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdfData structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdf
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 

Spline: Data Lineage For Spark Structured Streaming

  • 1. Marek Novotny, ABSA Vaclav Kosar, ABSA Spline: Data Lineage for Spark Structured Streaming #SAISExp18
  • 2. About Us •ABSA is a Pan-African financial services provider – With Apache Spark at the core of its data engineering •We try to fill gaps in the Hadoop eco-system •Contributions to Apache Spark •Spark-related open-source projects (github.com/AbsaOSS) – ABRiS – Avro SerDe for structured APIs (#SAISDev5) – Cobrix – Cobol data source – Atum – Completeness and accuracy library – Spline – Data lineage tracking and visualization tool (#EUent3) 2#SAISExp18
  • 3. • How data is calculated? • What is the schema and format of streamed data? 3#SAISExp18 01000110101101010
  • 4. 4#SAISExp18 Data Flow Job 2 Job 3 Job 1 01000110101101010 Topic A Topic B Topic Z Topic D Path /…/abc
  • 5. 5#SAISExp18 Transformations Job 3 Details Topic D //path/ Join colA + colB Topic Z Job 2 Job 3 Job 1 01000110101101010 Topic A Topic B Topic Z Topic D Path /…/abc
  • 6. 6#SAISExp18 Schema A Schema B Schema C Schema D Schema Z Schema C Schema D Job 2 Job 3 Job 1 Schemas and Formats 01000110101101010 Topic A Topic B Topic Z Topic D Path /…/abc
  • 7. Spline 7#SAISExp18 •To make Spark BCBS (Clarity) compliant •To communicate with business people
  • 8. Spline 8#SAISExp18 Dependencies •To make Spark BCBS (Clarity) compliant •To communicate with business people •Online documentation of –Job dependencies
  • 9. Spline 9#SAISExp18 Dependencies Details •To make Spark BCBS (Clarity) compliant •To communicate with business people •Online documentation of –Job dependencies – Particular Spark SQL jobs
  • 10. Spline •To make Spark BCBS (Clarity) compliant •To communicate with business people •Online documentation of –Job dependencies – Spark SQL job details – Attributes occurring in the logic 10#SAISExp18 Dependencies Details Attributes
  • 11. Lineage Tracking of Batch Jobs • Dataset-oriented • Leverages execution plans • Structured APIs only – SQL – Dataframes – Datasets • UDFs and lambdas are considered as black boxes 11#SAISExp18 Job Dataset A Lineage A
  • 12. Lineage Tracking of Streaming Jobs • Structured Streaming only • Source-oriented (topic) • Evolves in time 12#SAISExp18 App Lineage T1 Topic A Time Lineage T3 Lineage T2
  • 13. Structured Streaming Support 13#SAISExp18 Spark libraries Transformations Session Query Spark structured streaming job StreamingQueryManager • StreamingQueryManager
  • 14. Start Structured Streaming Support 14#SAISExp18 StreamingQueryManager Spline Streaming Listener Spark libraries Transformations Session Query Spark structured streaming job• StreamingQueryManager – Information about start
  • 15. Start Structured Streaming Support 15#SAISExp18 StreamingQueryManager Spline Streaming Listener Spark libraries Transformations Session Query Spark structured streaming job Give me exec. plans • StreamingQueryManager – Information about start – Can provide execution plans
  • 16. Start Structured Streaming Support 16#SAISExp18 StreamingQueryManager Spline Streaming Listener Spark libraries Transformations Session Query Lineage Model Spline UI Spark structured streaming job Execution Plans • StreamingQueryManager – Information about start – Can provide execution plans
  • 17. Start Structured Streaming Support 17#SAISExp18 StreamingQueryManager Spline Streaming Listener Spark libraries Transformations Session Query Lineage Model Spline UI Spark structured streaming job Event details ProgressExecution Plans • StreamingQueryManager – Information about start – Can provide execution plans – Information about progress • MicroBatch
  • 18. Interval View • Displays data flow in fixed interval 18#SAISExp18 Start End Time progress Job W1 Job R Job W2 progress progress progress progress progress Interval Job W1 Job R S1 S2 Job W2 S3
  • 19. Demo – Use Case 19#SAISExp18 What is temperature per hour in Prague? Station 2 Station NStation 1 ? Prague …
  • 20. Demo – Use Case Output 20#SAISExp18 0 5 10 15 20 25 30 35 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Temperature[°C] Hours Start End 2018-09-24
  • 21. Demo – Select Interval View 21#SAISExp18
  • 22. Demo – Select Interval 22#SAISExp18
  • 23. Demo – Select Sink 23#SAISExp18
  • 24. Demo – Find Highlighted Sink 24#SAISExp18
  • 25. Demo – Review The Lineage 25#SAISExp18
  • 26. Demo – Change The Interval 26#SAISExp18
  • 27. Demo – Observe New Lineage 27#SAISExp18
  • 28. Demo – Select A Job 28#SAISExp18
  • 29. Demo – Drill Down 29#SAISExp18
  • 30. Demo – Review Job Details 30#SAISExp18
  • 31. Demo – Select An Operation 31#SAISExp18
  • 32. Demo – See Operation Attributes 32#SAISExp18
  • 33. Interval View Limitations 33#SAISExp18 Start End Time progress Job W1 Job R Job W2 progress progress progress progress progress Interval Job R S1 S2 S310:21 10:25 10:30 10:35 10:45 10:51
  • 34. Interval View Limitations 34#SAISExp18 Job W1 Job R Start End Time progress Job W1 Job R Job W2 progress progress progress progress progress S1 S2 Interval View Interval 10:21 10:25 10:30 10:35 10:45 10:51
  • 35. Interval View Limitations • Edge case (delayed read, early write) – Job W1 should be linked – Job W2 should not be linked 35#SAISExp18 Job W1 Job R Start End Time progress Job W1 Job R Job W2 progress progress progress progress progress S1 S2 Job W2 Job R S1 S3 Lineage Interval Interval View 10:21 10:25 10:30 10:35 10:45 10:51
  • 36. Beyond The Interval View • Instead of timestamp use addresses of rows • SS has addresses (offsets) on each source, but not on sinks • Most sinks are also sources and thus could return offsets 36#SAISExp18 Source 2 Offsets: 3 - 5 Job Source 3 Offsets: 12 - 14 Sink Offsets: ? Progress Event
  • 38. Offset-Based Linking 38#SAISExp18 Job R Progress offset offset offset S3 S2 S1 Job R S1 Selected
  • 39. Offset-Based Linking 39#SAISExp18 Job R Progress offset offset offset 3 - 5 12 - 14 S3 S2 S1 Job R S1 S2 S3 3 - 5 12 - 14 Selected
  • 40. Offset-Based Linking 40#SAISExp18 Job W2 Progress Job R Progress offset offset offset 3 - 5 12 - 14 S3 S2 S1 Job R S1 S2 Job W2 S3 3 - 5 9 - 19 12 - 14 Selected
  • 41. Offset-Based Linking 41#SAISExp18 Job W2 Progress Job W1 Progress Job R Progress offset offset offset 3 - 5 12 - 14 22 - 27 S3 S2 S1 Job X Progress Job W1 Job R S1 S2 Job W2 S3 3 - 5 22 - 27 9 - 19 12 - 14 Selected
  • 42. Offset-Based Linking • Jobs are linked when progress offsets overlap • Offset timestamp doesn’t matter 42#SAISExp18 Job W1 Job R S1 S2 Job W2 S3 3 - 5 22 - 27 9 - 19 12 - 14 Job W2 Progress Job R Progress offset offset offset 3 - 5 12 - 14 22 - 27 S3 S2 S1 Job W1 Progress Job X Progress Selected
  • 43. Conclusion • Spline: data lineage tracking tool • New support for Structured Streaming • Demo POC: Interval View • Proposed generalization: offset-based linking 43#SAISExp18
  • 44. Future Plans • Release Interval View in Spline • After changes to Spark: – Offset based linking for micro-batch streaming – Continuous streaming support • Support for dataset checkpoints 44#SAISExp18
  • 45. Questions • Now is a good time • Or feel free to contact us – Marek Novotny • mn.mikke@gmail.com – Vaclav Kosar • admin@vaclavkosar.com • github.com/AbsaOSS/spline 45#SAISExp18