SlideShare a Scribd company logo
1 of 45
Marek Novotny, ABSA
Vaclav Kosar, ABSA
Spline: Data Lineage for
Spark Structured Streaming
#SAISExp18
About Us
•ABSA is a Pan-African financial services provider
– With Apache Spark at the core of its data engineering
•We try to fill gaps in the Hadoop eco-system
•Contributions to Apache Spark
•Spark-related open-source projects (github.com/AbsaOSS)
– ABRiS – Avro SerDe for structured APIs (#SAISDev5)
– Cobrix – Cobol data source
– Atum – Completeness and accuracy library
– Spline – Data lineage tracking and visualization tool (#EUent3)
2#SAISExp18
• How data is calculated?
• What is the schema and format of
streamed data?
3#SAISExp18
01000110101101010
4#SAISExp18
Data Flow
Job 2
Job 3
Job 1
01000110101101010
Topic A
Topic B
Topic Z
Topic D
Path /…/abc
5#SAISExp18
Transformations Job 3 Details
Topic D //path/
Join
colA + colB
Topic Z
Job 2
Job 3
Job 1
01000110101101010
Topic A
Topic B
Topic Z
Topic D
Path /…/abc
6#SAISExp18
Schema A
Schema B
Schema C
Schema D
Schema Z
Schema C
Schema D
Job 2
Job 3
Job 1
Schemas and Formats
01000110101101010
Topic A
Topic B
Topic Z
Topic D
Path /…/abc
Spline
7#SAISExp18
•To make Spark BCBS
(Clarity) compliant
•To communicate with
business people
Spline
8#SAISExp18
Dependencies
•To make Spark BCBS
(Clarity) compliant
•To communicate with
business people
•Online documentation of
–Job dependencies
Spline
9#SAISExp18
Dependencies Details
•To make Spark BCBS
(Clarity) compliant
•To communicate with
business people
•Online documentation of
–Job dependencies
– Particular Spark SQL jobs
Spline
•To make Spark BCBS
(Clarity) compliant
•To communicate with
business people
•Online documentation of
–Job dependencies
– Spark SQL job details
– Attributes occurring in the
logic
10#SAISExp18
Dependencies Details Attributes
Lineage Tracking of Batch Jobs
• Dataset-oriented
• Leverages execution plans
• Structured APIs only
– SQL
– Dataframes
– Datasets
• UDFs and lambdas are
considered as black boxes
11#SAISExp18
Job
Dataset A
Lineage A
Lineage Tracking of Streaming Jobs
• Structured Streaming only
• Source-oriented (topic)
• Evolves in time
12#SAISExp18
App
Lineage T1
Topic A
Time
Lineage T3
Lineage T2
Structured Streaming Support
13#SAISExp18
Spark libraries
Transformations
Session
Query
Spark structured streaming job
StreamingQueryManager
• StreamingQueryManager
Start
Structured Streaming Support
14#SAISExp18
StreamingQueryManager
Spline Streaming Listener
Spark libraries
Transformations
Session
Query
Spark structured streaming job• StreamingQueryManager
– Information about start
Start
Structured Streaming Support
15#SAISExp18
StreamingQueryManager
Spline Streaming Listener
Spark libraries
Transformations
Session
Query
Spark structured streaming job
Give me exec. plans
• StreamingQueryManager
– Information about start
– Can provide execution
plans
Start
Structured Streaming Support
16#SAISExp18
StreamingQueryManager
Spline Streaming Listener
Spark libraries
Transformations
Session
Query
Lineage Model
Spline UI
Spark structured streaming job
Execution Plans
• StreamingQueryManager
– Information about start
– Can provide execution
plans
Start
Structured Streaming Support
17#SAISExp18
StreamingQueryManager
Spline Streaming Listener
Spark libraries
Transformations
Session
Query
Lineage Model
Spline UI
Spark structured streaming job
Event details
ProgressExecution Plans
• StreamingQueryManager
– Information about start
– Can provide execution
plans
– Information about progress
• MicroBatch
Interval View
• Displays data flow in fixed interval
18#SAISExp18
Start End Time
progress
Job W1
Job R
Job W2 progress
progress progress
progress progress
Interval
Job W1
Job R
S1
S2
Job W2
S3
Demo – Use Case
19#SAISExp18
What is temperature per hour in Prague?
Station 2 Station NStation 1
?
Prague
…
Demo – Use Case Output
20#SAISExp18
0
5
10
15
20
25
30
35
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Temperature[°C]
Hours Start End
2018-09-24
Demo – Select Interval View
21#SAISExp18
Demo – Select Interval
22#SAISExp18
Demo – Select Sink
23#SAISExp18
Demo – Find Highlighted Sink
24#SAISExp18
Demo – Review The Lineage
25#SAISExp18
Demo – Change The Interval
26#SAISExp18
Demo – Observe New Lineage
27#SAISExp18
Demo – Select A Job
28#SAISExp18
Demo – Drill Down
29#SAISExp18
Demo – Review Job Details
30#SAISExp18
Demo – Select An Operation
31#SAISExp18
Demo – See Operation Attributes
32#SAISExp18
Interval View Limitations
33#SAISExp18
Start End Time
progress
Job W1
Job R
Job W2 progress
progress progress
progress progress
Interval
Job R
S1
S2 S310:21 10:25
10:30 10:35
10:45 10:51
Interval View Limitations
34#SAISExp18
Job W1
Job R
Start End Time
progress
Job W1
Job R
Job W2 progress
progress progress
progress progress
S1
S2
Interval View
Interval
10:21 10:25
10:30 10:35
10:45 10:51
Interval View Limitations
• Edge case (delayed read, early write)
– Job W1 should be linked
– Job W2 should not be linked
35#SAISExp18
Job W1
Job R
Start End Time
progress
Job W1
Job R
Job W2 progress
progress progress
progress progress
S1
S2
Job W2
Job R
S1
S3
Lineage
Interval
Interval View
10:21 10:25
10:30 10:35
10:45 10:51
Beyond The Interval View
• Instead of timestamp use
addresses of rows
• SS has addresses (offsets) on
each source, but not on sinks
• Most sinks are also sources and
thus could return offsets
36#SAISExp18
Source 2
Offsets:
3 - 5
Job
Source 3
Offsets:
12 - 14
Sink
Offsets:
?
Progress Event
Offset-Based Linking
37#SAISExp18
offset
offset
offset
Selected
S3
S2
S1
S1
Offset-Based Linking
38#SAISExp18
Job R Progress
offset
offset
offset
S3
S2
S1
Job R
S1
Selected
Offset-Based Linking
39#SAISExp18
Job R Progress
offset
offset
offset
3 - 5
12 - 14
S3
S2
S1
Job R
S1
S2 S3
3 - 5 12 - 14
Selected
Offset-Based Linking
40#SAISExp18
Job W2 Progress
Job R Progress
offset
offset
offset
3 - 5
12 - 14
S3
S2
S1
Job R
S1
S2
Job W2
S3
3 - 5
9 - 19
12 - 14
Selected
Offset-Based Linking
41#SAISExp18
Job W2 Progress
Job W1
Progress
Job R Progress
offset
offset
offset
3 - 5
12 - 14
22 - 27
S3
S2
S1
Job X
Progress
Job W1
Job R
S1
S2
Job W2
S3
3 - 5
22 - 27 9 - 19
12 - 14
Selected
Offset-Based Linking
• Jobs are linked when progress offsets overlap
• Offset timestamp doesn’t matter
42#SAISExp18
Job W1
Job R
S1
S2
Job W2
S3
3 - 5
22 - 27 9 - 19
12 - 14
Job W2 Progress
Job R Progress
offset
offset
offset
3 - 5
12 - 14
22 - 27
S3
S2
S1
Job W1
Progress
Job X
Progress
Selected
Conclusion
• Spline: data lineage tracking tool
• New support for Structured Streaming
• Demo POC: Interval View
• Proposed generalization: offset-based linking
43#SAISExp18
Future Plans
• Release Interval View in Spline
• After changes to Spark:
– Offset based linking for micro-batch streaming
– Continuous streaming support
• Support for dataset checkpoints
44#SAISExp18
Questions
• Now is a good time
• Or feel free to contact us
– Marek Novotny
• mn.mikke@gmail.com
– Vaclav Kosar
• admin@vaclavkosar.com
• github.com/AbsaOSS/spline
45#SAISExp18

More Related Content

What's hot

Kafka replication apachecon_2013
Kafka replication apachecon_2013Kafka replication apachecon_2013
Kafka replication apachecon_2013
Jun Rao
 
SQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at ComcastSQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at Comcast
Databricks
 
Oracle backup and recovery
Oracle backup and recoveryOracle backup and recovery
Oracle backup and recovery
Yogiji Creations
 
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Databricks
 

What's hot (20)

Snowflake Data Governance
Snowflake Data GovernanceSnowflake Data Governance
Snowflake Data Governance
 
Data Lake - Multitenancy Best Practices
Data Lake - Multitenancy Best PracticesData Lake - Multitenancy Best Practices
Data Lake - Multitenancy Best Practices
 
Oracle architecture ppt
Oracle architecture pptOracle architecture ppt
Oracle architecture ppt
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data Analytics
 
Kafka replication apachecon_2013
Kafka replication apachecon_2013Kafka replication apachecon_2013
Kafka replication apachecon_2013
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
 
Presentation sql server to oracle a database migration roadmap
Presentation    sql server to oracle a database migration roadmapPresentation    sql server to oracle a database migration roadmap
Presentation sql server to oracle a database migration roadmap
 
SQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at ComcastSQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at Comcast
 
[pgday.Seoul 2022] PostgreSQL구조 - 윤성재
[pgday.Seoul 2022] PostgreSQL구조 - 윤성재[pgday.Seoul 2022] PostgreSQL구조 - 윤성재
[pgday.Seoul 2022] PostgreSQL구조 - 윤성재
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
 
Data mesh
Data meshData mesh
Data mesh
 
Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent...
Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent...Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent...
Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent...
 
Data Federation with Apache Spark
Data Federation with Apache SparkData Federation with Apache Spark
Data Federation with Apache Spark
 
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergData Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
 
Facebook architecture
Facebook architectureFacebook architecture
Facebook architecture
 
One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkOne sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async Sink
 
Oracle backup and recovery
Oracle backup and recoveryOracle backup and recovery
Oracle backup and recovery
 
Oracle Active Data Guard: Best Practices and New Features Deep Dive
Oracle Active Data Guard: Best Practices and New Features Deep Dive Oracle Active Data Guard: Best Practices and New Features Deep Dive
Oracle Active Data Guard: Best Practices and New Features Deep Dive
 
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 

Similar to Spline: Data Lineage For Spark Structured Streaming

Aminullah Assagaf_P7-Ch.9_Project management-32.pptx
Aminullah Assagaf_P7-Ch.9_Project management-32.pptxAminullah Assagaf_P7-Ch.9_Project management-32.pptx
Aminullah Assagaf_P7-Ch.9_Project management-32.pptx
Aminullah Assagaf
 

Similar to Spline: Data Lineage For Spark Structured Streaming (20)

HBaseCon2017 Data Product at AirBnB
HBaseCon2017 Data Product at AirBnBHBaseCon2017 Data Product at AirBnB
HBaseCon2017 Data Product at AirBnB
 
SplunkSummit 2015 - Update on Splunk Enterprise 6.3 & Hunk 6.3
SplunkSummit 2015 - Update on Splunk Enterprise 6.3 & Hunk 6.3SplunkSummit 2015 - Update on Splunk Enterprise 6.3 & Hunk 6.3
SplunkSummit 2015 - Update on Splunk Enterprise 6.3 & Hunk 6.3
 
Compiling openCypher graph queries with Spark Catalyst
Compiling openCypher graph queries with Spark CatalystCompiling openCypher graph queries with Spark Catalyst
Compiling openCypher graph queries with Spark Catalyst
 
Apache HAWQ Architecture
Apache HAWQ ArchitectureApache HAWQ Architecture
Apache HAWQ Architecture
 
Streaming SQL to unify batch and stream processing: Theory and practice with ...
Streaming SQL to unify batch and stream processing: Theory and practice with ...Streaming SQL to unify batch and stream processing: Theory and practice with ...
Streaming SQL to unify batch and stream processing: Theory and practice with ...
 
Building Stream Processing as a Service
Building Stream Processing as a ServiceBuilding Stream Processing as a Service
Building Stream Processing as a Service
 
Streaming SQL Foundations: Why I ❤ Streams+Tables
Streaming SQL Foundations: Why I ❤ Streams+TablesStreaming SQL Foundations: Why I ❤ Streams+Tables
Streaming SQL Foundations: Why I ❤ Streams+Tables
 
Reactive database access with Slick3
Reactive database access with Slick3Reactive database access with Slick3
Reactive database access with Slick3
 
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
 
cpmpertmy (1)
cpmpertmy (1)cpmpertmy (1)
cpmpertmy (1)
 
Fascinate with SQL SSIS Parallel processing
Fascinate with SQL SSIS Parallel processing Fascinate with SQL SSIS Parallel processing
Fascinate with SQL SSIS Parallel processing
 
Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...
Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...
Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...
 
StructuredStreaming webinar slides.pptx
StructuredStreaming webinar slides.pptxStructuredStreaming webinar slides.pptx
StructuredStreaming webinar slides.pptx
 
Aminullah Assagaf_P7-Ch.9_Project management-32.pptx
Aminullah Assagaf_P7-Ch.9_Project management-32.pptxAminullah Assagaf_P7-Ch.9_Project management-32.pptx
Aminullah Assagaf_P7-Ch.9_Project management-32.pptx
 
Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming ...
Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming ...Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming ...
Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming ...
 
AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Ac...
AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Ac...AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Ac...
AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Ac...
 
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringApache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
 
Architecting a next generation data platform
Architecting a next generation data platformArchitecting a next generation data platform
Architecting a next generation data platform
 
Data Pipeline for The Big Data/Data Science OKC
Data Pipeline for The Big Data/Data Science OKCData Pipeline for The Big Data/Data Science OKC
Data Pipeline for The Big Data/Data Science OKC
 
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...
 

More from Vaclav Kosar

More from Vaclav Kosar (6)

Conversation with-search-engines (Ren et al. 2020)
Conversation with-search-engines (Ren et al. 2020)Conversation with-search-engines (Ren et al. 2020)
Conversation with-search-engines (Ren et al. 2020)
 
FastText Vector Norms And OOV Words
FastText Vector Norms And OOV WordsFastText Vector Norms And OOV Words
FastText Vector Norms And OOV Words
 
Simulation of Soft Photon Calorimeter @ 2011 JINR, Dubna Student Practice
Simulation of Soft Photon Calorimeter @ 2011 JINR, Dubna Student PracticeSimulation of Soft Photon Calorimeter @ 2011 JINR, Dubna Student Practice
Simulation of Soft Photon Calorimeter @ 2011 JINR, Dubna Student Practice
 
Spline 0.3 and Plans for 0.4
Spline 0.3 and Plans for 0.4 Spline 0.3 and Plans for 0.4
Spline 0.3 and Plans for 0.4
 
Spline 0.3 User Guide
Spline 0.3 User GuideSpline 0.3 User Guide
Spline 0.3 User Guide
 
Spline 2 - Vision and Architecture Overview
Spline 2 - Vision and Architecture OverviewSpline 2 - Vision and Architecture Overview
Spline 2 - Vision and Architecture Overview
 

Recently uploaded

Recently uploaded (20)

Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджера
 
UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2
 
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxWSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John Staveley
 
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdfSimplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
 
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
 
UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1
 
ECS 2024 Teams Premium - Pretty Secure
ECS 2024   Teams Premium - Pretty SecureECS 2024   Teams Premium - Pretty Secure
ECS 2024 Teams Premium - Pretty Secure
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
 
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfIntroduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
 
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeFree and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and Planning
 
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfWhere to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
 
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
 
A Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System StrategyA Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System Strategy
 
Syngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdfSyngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdf
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
 
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
 
Buy Epson EcoTank L3210 Colour Printer Online.pdf
Buy Epson EcoTank L3210 Colour Printer Online.pdfBuy Epson EcoTank L3210 Colour Printer Online.pdf
Buy Epson EcoTank L3210 Colour Printer Online.pdf
 
Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024
 

Spline: Data Lineage For Spark Structured Streaming

  • 1. Marek Novotny, ABSA Vaclav Kosar, ABSA Spline: Data Lineage for Spark Structured Streaming #SAISExp18
  • 2. About Us •ABSA is a Pan-African financial services provider – With Apache Spark at the core of its data engineering •We try to fill gaps in the Hadoop eco-system •Contributions to Apache Spark •Spark-related open-source projects (github.com/AbsaOSS) – ABRiS – Avro SerDe for structured APIs (#SAISDev5) – Cobrix – Cobol data source – Atum – Completeness and accuracy library – Spline – Data lineage tracking and visualization tool (#EUent3) 2#SAISExp18
  • 3. • How data is calculated? • What is the schema and format of streamed data? 3#SAISExp18 01000110101101010
  • 4. 4#SAISExp18 Data Flow Job 2 Job 3 Job 1 01000110101101010 Topic A Topic B Topic Z Topic D Path /…/abc
  • 5. 5#SAISExp18 Transformations Job 3 Details Topic D //path/ Join colA + colB Topic Z Job 2 Job 3 Job 1 01000110101101010 Topic A Topic B Topic Z Topic D Path /…/abc
  • 6. 6#SAISExp18 Schema A Schema B Schema C Schema D Schema Z Schema C Schema D Job 2 Job 3 Job 1 Schemas and Formats 01000110101101010 Topic A Topic B Topic Z Topic D Path /…/abc
  • 7. Spline 7#SAISExp18 •To make Spark BCBS (Clarity) compliant •To communicate with business people
  • 8. Spline 8#SAISExp18 Dependencies •To make Spark BCBS (Clarity) compliant •To communicate with business people •Online documentation of –Job dependencies
  • 9. Spline 9#SAISExp18 Dependencies Details •To make Spark BCBS (Clarity) compliant •To communicate with business people •Online documentation of –Job dependencies – Particular Spark SQL jobs
  • 10. Spline •To make Spark BCBS (Clarity) compliant •To communicate with business people •Online documentation of –Job dependencies – Spark SQL job details – Attributes occurring in the logic 10#SAISExp18 Dependencies Details Attributes
  • 11. Lineage Tracking of Batch Jobs • Dataset-oriented • Leverages execution plans • Structured APIs only – SQL – Dataframes – Datasets • UDFs and lambdas are considered as black boxes 11#SAISExp18 Job Dataset A Lineage A
  • 12. Lineage Tracking of Streaming Jobs • Structured Streaming only • Source-oriented (topic) • Evolves in time 12#SAISExp18 App Lineage T1 Topic A Time Lineage T3 Lineage T2
  • 13. Structured Streaming Support 13#SAISExp18 Spark libraries Transformations Session Query Spark structured streaming job StreamingQueryManager • StreamingQueryManager
  • 14. Start Structured Streaming Support 14#SAISExp18 StreamingQueryManager Spline Streaming Listener Spark libraries Transformations Session Query Spark structured streaming job• StreamingQueryManager – Information about start
  • 15. Start Structured Streaming Support 15#SAISExp18 StreamingQueryManager Spline Streaming Listener Spark libraries Transformations Session Query Spark structured streaming job Give me exec. plans • StreamingQueryManager – Information about start – Can provide execution plans
  • 16. Start Structured Streaming Support 16#SAISExp18 StreamingQueryManager Spline Streaming Listener Spark libraries Transformations Session Query Lineage Model Spline UI Spark structured streaming job Execution Plans • StreamingQueryManager – Information about start – Can provide execution plans
  • 17. Start Structured Streaming Support 17#SAISExp18 StreamingQueryManager Spline Streaming Listener Spark libraries Transformations Session Query Lineage Model Spline UI Spark structured streaming job Event details ProgressExecution Plans • StreamingQueryManager – Information about start – Can provide execution plans – Information about progress • MicroBatch
  • 18. Interval View • Displays data flow in fixed interval 18#SAISExp18 Start End Time progress Job W1 Job R Job W2 progress progress progress progress progress Interval Job W1 Job R S1 S2 Job W2 S3
  • 19. Demo – Use Case 19#SAISExp18 What is temperature per hour in Prague? Station 2 Station NStation 1 ? Prague …
  • 20. Demo – Use Case Output 20#SAISExp18 0 5 10 15 20 25 30 35 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Temperature[°C] Hours Start End 2018-09-24
  • 21. Demo – Select Interval View 21#SAISExp18
  • 22. Demo – Select Interval 22#SAISExp18
  • 23. Demo – Select Sink 23#SAISExp18
  • 24. Demo – Find Highlighted Sink 24#SAISExp18
  • 25. Demo – Review The Lineage 25#SAISExp18
  • 26. Demo – Change The Interval 26#SAISExp18
  • 27. Demo – Observe New Lineage 27#SAISExp18
  • 28. Demo – Select A Job 28#SAISExp18
  • 29. Demo – Drill Down 29#SAISExp18
  • 30. Demo – Review Job Details 30#SAISExp18
  • 31. Demo – Select An Operation 31#SAISExp18
  • 32. Demo – See Operation Attributes 32#SAISExp18
  • 33. Interval View Limitations 33#SAISExp18 Start End Time progress Job W1 Job R Job W2 progress progress progress progress progress Interval Job R S1 S2 S310:21 10:25 10:30 10:35 10:45 10:51
  • 34. Interval View Limitations 34#SAISExp18 Job W1 Job R Start End Time progress Job W1 Job R Job W2 progress progress progress progress progress S1 S2 Interval View Interval 10:21 10:25 10:30 10:35 10:45 10:51
  • 35. Interval View Limitations • Edge case (delayed read, early write) – Job W1 should be linked – Job W2 should not be linked 35#SAISExp18 Job W1 Job R Start End Time progress Job W1 Job R Job W2 progress progress progress progress progress S1 S2 Job W2 Job R S1 S3 Lineage Interval Interval View 10:21 10:25 10:30 10:35 10:45 10:51
  • 36. Beyond The Interval View • Instead of timestamp use addresses of rows • SS has addresses (offsets) on each source, but not on sinks • Most sinks are also sources and thus could return offsets 36#SAISExp18 Source 2 Offsets: 3 - 5 Job Source 3 Offsets: 12 - 14 Sink Offsets: ? Progress Event
  • 38. Offset-Based Linking 38#SAISExp18 Job R Progress offset offset offset S3 S2 S1 Job R S1 Selected
  • 39. Offset-Based Linking 39#SAISExp18 Job R Progress offset offset offset 3 - 5 12 - 14 S3 S2 S1 Job R S1 S2 S3 3 - 5 12 - 14 Selected
  • 40. Offset-Based Linking 40#SAISExp18 Job W2 Progress Job R Progress offset offset offset 3 - 5 12 - 14 S3 S2 S1 Job R S1 S2 Job W2 S3 3 - 5 9 - 19 12 - 14 Selected
  • 41. Offset-Based Linking 41#SAISExp18 Job W2 Progress Job W1 Progress Job R Progress offset offset offset 3 - 5 12 - 14 22 - 27 S3 S2 S1 Job X Progress Job W1 Job R S1 S2 Job W2 S3 3 - 5 22 - 27 9 - 19 12 - 14 Selected
  • 42. Offset-Based Linking • Jobs are linked when progress offsets overlap • Offset timestamp doesn’t matter 42#SAISExp18 Job W1 Job R S1 S2 Job W2 S3 3 - 5 22 - 27 9 - 19 12 - 14 Job W2 Progress Job R Progress offset offset offset 3 - 5 12 - 14 22 - 27 S3 S2 S1 Job W1 Progress Job X Progress Selected
  • 43. Conclusion • Spline: data lineage tracking tool • New support for Structured Streaming • Demo POC: Interval View • Proposed generalization: offset-based linking 43#SAISExp18
  • 44. Future Plans • Release Interval View in Spline • After changes to Spark: – Offset based linking for micro-batch streaming – Continuous streaming support • Support for dataset checkpoints 44#SAISExp18
  • 45. Questions • Now is a good time • Or feel free to contact us – Marek Novotny • mn.mikke@gmail.com – Vaclav Kosar • admin@vaclavkosar.com • github.com/AbsaOSS/spline 45#SAISExp18