Marek Novotny, ABSA
Vaclav Kosar, ABSA
Spline: Data Lineage for
Spark Structured Streaming
#SAISExp18
About Us
•ABSA is a Pan-African financial services provider
– With Apache Spark at the core of its data engineering
•We try to fill gaps in the Hadoop eco-system
•Contributions to Apache Spark
•Spark-related open-source projects (github.com/AbsaOSS)
– ABRiS – Avro SerDe for structured APIs (#SAISDev5)
– Cobrix – Cobol data source
– Atum – Completeness and accuracy library
– Spline – Data lineage tracking and visualization tool (#EUent3)
2#SAISExp18
• How data is calculated?
• What is the schema and format of
streamed data?
3#SAISExp18
01000110101101010
4#SAISExp18
Data Flow
Job 2
Job 3
Job 1
01000110101101010
Topic A
Topic B
Topic Z
Topic D
Path /…/abc
5#SAISExp18
Transformations Job 3 Details
Topic D //path/
Join
colA + colB
Topic Z
Job 2
Job 3
Job 1
01000110101101010
Topic A
Topic B
Topic Z
Topic D
Path /…/abc
6#SAISExp18
Schema A
Schema B
Schema C
Schema D
Schema Z
Schema C
Schema D
Job 2
Job 3
Job 1
Schemas and Formats
01000110101101010
Topic A
Topic B
Topic Z
Topic D
Path /…/abc
Spline
7#SAISExp18
•To make Spark BCBS
(Clarity) compliant
•To communicate with
business people
Spline
8#SAISExp18
Dependencies
•To make Spark BCBS
(Clarity) compliant
•To communicate with
business people
•Online documentation of
–Job dependencies
Spline
9#SAISExp18
Dependencies Details
•To make Spark BCBS
(Clarity) compliant
•To communicate with
business people
•Online documentation of
–Job dependencies
– Particular Spark SQL jobs
Spline
•To make Spark BCBS
(Clarity) compliant
•To communicate with
business people
•Online documentation of
–Job dependencies
– Spark SQL job details
– Attributes occurring in the
logic
10#SAISExp18
Dependencies Details Attributes
Lineage Tracking of Batch Jobs
• Dataset-oriented
• Leverages execution plans
• Structured APIs only
– SQL
– Dataframes
– Datasets
• UDFs and lambdas are
considered as black boxes
11#SAISExp18
Job
Dataset A
Lineage A
Lineage Tracking of Streaming Jobs
• Structured Streaming only
• Source-oriented (topic)
• Evolves in time
12#SAISExp18
App
Lineage T1
Topic A
Time
Lineage T3
Lineage T2
Structured Streaming Support
13#SAISExp18
Spark libraries
Transformations
Session
Query
Spark structured streaming job
StreamingQueryManager
• StreamingQueryManager
Start
Structured Streaming Support
14#SAISExp18
StreamingQueryManager
Spline Streaming Listener
Spark libraries
Transformations
Session
Query
Spark structured streaming job• StreamingQueryManager
– Information about start
Start
Structured Streaming Support
15#SAISExp18
StreamingQueryManager
Spline Streaming Listener
Spark libraries
Transformations
Session
Query
Spark structured streaming job
Give me exec. plans
• StreamingQueryManager
– Information about start
– Can provide execution
plans
Start
Structured Streaming Support
16#SAISExp18
StreamingQueryManager
Spline Streaming Listener
Spark libraries
Transformations
Session
Query
Lineage Model
Spline UI
Spark structured streaming job
Execution Plans
• StreamingQueryManager
– Information about start
– Can provide execution
plans
Start
Structured Streaming Support
17#SAISExp18
StreamingQueryManager
Spline Streaming Listener
Spark libraries
Transformations
Session
Query
Lineage Model
Spline UI
Spark structured streaming job
Event details
ProgressExecution Plans
• StreamingQueryManager
– Information about start
– Can provide execution
plans
– Information about progress
• MicroBatch
Interval View
• Displays data flow in fixed interval
18#SAISExp18
Start End Time
progress
Job W1
Job R
Job W2 progress
progress progress
progress progress
Interval
Job W1
Job R
S1
S2
Job W2
S3
Demo – Use Case
19#SAISExp18
What is temperature per hour in Prague?
Station 2 Station NStation 1
?
Prague
…
Demo – Use Case Output
20#SAISExp18
0
5
10
15
20
25
30
35
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Temperature[°C]
Hours Start End
2018-09-24
Demo – Select Interval View
21#SAISExp18
Demo – Select Interval
22#SAISExp18
Demo – Select Sink
23#SAISExp18
Demo – Find Highlighted Sink
24#SAISExp18
Demo – Review The Lineage
25#SAISExp18
Demo – Change The Interval
26#SAISExp18
Demo – Observe New Lineage
27#SAISExp18
Demo – Select A Job
28#SAISExp18
Demo – Drill Down
29#SAISExp18
Demo – Review Job Details
30#SAISExp18
Demo – Select An Operation
31#SAISExp18
Demo – See Operation Attributes
32#SAISExp18
Interval View Limitations
33#SAISExp18
Start End Time
progress
Job W1
Job R
Job W2 progress
progress progress
progress progress
Interval
Job R
S1
S2 S310:21 10:25
10:30 10:35
10:45 10:51
Interval View Limitations
34#SAISExp18
Job W1
Job R
Start End Time
progress
Job W1
Job R
Job W2 progress
progress progress
progress progress
S1
S2
Interval View
Interval
10:21 10:25
10:30 10:35
10:45 10:51
Interval View Limitations
• Edge case (delayed read, early write)
– Job W1 should be linked
– Job W2 should not be linked
35#SAISExp18
Job W1
Job R
Start End Time
progress
Job W1
Job R
Job W2 progress
progress progress
progress progress
S1
S2
Job W2
Job R
S1
S3
Lineage
Interval
Interval View
10:21 10:25
10:30 10:35
10:45 10:51
Beyond The Interval View
• Instead of timestamp use
addresses of rows
• SS has addresses (offsets) on
each source, but not on sinks
• Most sinks are also sources and
thus could return offsets
36#SAISExp18
Source 2
Offsets:
3 - 5
Job
Source 3
Offsets:
12 - 14
Sink
Offsets:
?
Progress Event
Offset-Based Linking
37#SAISExp18
offset
offset
offset
Selected
S3
S2
S1
S1
Offset-Based Linking
38#SAISExp18
Job R Progress
offset
offset
offset
S3
S2
S1
Job R
S1
Selected
Offset-Based Linking
39#SAISExp18
Job R Progress
offset
offset
offset
3 - 5
12 - 14
S3
S2
S1
Job R
S1
S2 S3
3 - 5 12 - 14
Selected
Offset-Based Linking
40#SAISExp18
Job W2 Progress
Job R Progress
offset
offset
offset
3 - 5
12 - 14
S3
S2
S1
Job R
S1
S2
Job W2
S3
3 - 5
9 - 19
12 - 14
Selected
Offset-Based Linking
41#SAISExp18
Job W2 Progress
Job W1
Progress
Job R Progress
offset
offset
offset
3 - 5
12 - 14
22 - 27
S3
S2
S1
Job X
Progress
Job W1
Job R
S1
S2
Job W2
S3
3 - 5
22 - 27 9 - 19
12 - 14
Selected
Offset-Based Linking
• Jobs are linked when progress offsets overlap
• Offset timestamp doesn’t matter
42#SAISExp18
Job W1
Job R
S1
S2
Job W2
S3
3 - 5
22 - 27 9 - 19
12 - 14
Job W2 Progress
Job R Progress
offset
offset
offset
3 - 5
12 - 14
22 - 27
S3
S2
S1
Job W1
Progress
Job X
Progress
Selected
Conclusion
• Spline: data lineage tracking tool
• New support for Structured Streaming
• Demo POC: Interval View
• Proposed generalization: offset-based linking
43#SAISExp18
Future Plans
• Release Interval View in Spline
• After changes to Spark:
– Offset based linking for micro-batch streaming
– Continuous streaming support
• Support for dataset checkpoints
44#SAISExp18
Questions
• Now is a good time
• Or feel free to contact us
– Marek Novotny
• mn.mikke@gmail.com
– Vaclav Kosar
• admin@vaclavkosar.com
• github.com/AbsaOSS/spline
45#SAISExp18

Spline: Data Lineage For Spark Structured Streaming

  • 1.
    Marek Novotny, ABSA VaclavKosar, ABSA Spline: Data Lineage for Spark Structured Streaming #SAISExp18
  • 2.
    About Us •ABSA isa Pan-African financial services provider – With Apache Spark at the core of its data engineering •We try to fill gaps in the Hadoop eco-system •Contributions to Apache Spark •Spark-related open-source projects (github.com/AbsaOSS) – ABRiS – Avro SerDe for structured APIs (#SAISDev5) – Cobrix – Cobol data source – Atum – Completeness and accuracy library – Spline – Data lineage tracking and visualization tool (#EUent3) 2#SAISExp18
  • 3.
    • How datais calculated? • What is the schema and format of streamed data? 3#SAISExp18 01000110101101010
  • 4.
    4#SAISExp18 Data Flow Job 2 Job3 Job 1 01000110101101010 Topic A Topic B Topic Z Topic D Path /…/abc
  • 5.
    5#SAISExp18 Transformations Job 3Details Topic D //path/ Join colA + colB Topic Z Job 2 Job 3 Job 1 01000110101101010 Topic A Topic B Topic Z Topic D Path /…/abc
  • 6.
    6#SAISExp18 Schema A Schema B SchemaC Schema D Schema Z Schema C Schema D Job 2 Job 3 Job 1 Schemas and Formats 01000110101101010 Topic A Topic B Topic Z Topic D Path /…/abc
  • 7.
    Spline 7#SAISExp18 •To make SparkBCBS (Clarity) compliant •To communicate with business people
  • 8.
    Spline 8#SAISExp18 Dependencies •To make SparkBCBS (Clarity) compliant •To communicate with business people •Online documentation of –Job dependencies
  • 9.
    Spline 9#SAISExp18 Dependencies Details •To makeSpark BCBS (Clarity) compliant •To communicate with business people •Online documentation of –Job dependencies – Particular Spark SQL jobs
  • 10.
    Spline •To make SparkBCBS (Clarity) compliant •To communicate with business people •Online documentation of –Job dependencies – Spark SQL job details – Attributes occurring in the logic 10#SAISExp18 Dependencies Details Attributes
  • 11.
    Lineage Tracking ofBatch Jobs • Dataset-oriented • Leverages execution plans • Structured APIs only – SQL – Dataframes – Datasets • UDFs and lambdas are considered as black boxes 11#SAISExp18 Job Dataset A Lineage A
  • 12.
    Lineage Tracking ofStreaming Jobs • Structured Streaming only • Source-oriented (topic) • Evolves in time 12#SAISExp18 App Lineage T1 Topic A Time Lineage T3 Lineage T2
  • 13.
    Structured Streaming Support 13#SAISExp18 Sparklibraries Transformations Session Query Spark structured streaming job StreamingQueryManager • StreamingQueryManager
  • 14.
    Start Structured Streaming Support 14#SAISExp18 StreamingQueryManager SplineStreaming Listener Spark libraries Transformations Session Query Spark structured streaming job• StreamingQueryManager – Information about start
  • 15.
    Start Structured Streaming Support 15#SAISExp18 StreamingQueryManager SplineStreaming Listener Spark libraries Transformations Session Query Spark structured streaming job Give me exec. plans • StreamingQueryManager – Information about start – Can provide execution plans
  • 16.
    Start Structured Streaming Support 16#SAISExp18 StreamingQueryManager SplineStreaming Listener Spark libraries Transformations Session Query Lineage Model Spline UI Spark structured streaming job Execution Plans • StreamingQueryManager – Information about start – Can provide execution plans
  • 17.
    Start Structured Streaming Support 17#SAISExp18 StreamingQueryManager SplineStreaming Listener Spark libraries Transformations Session Query Lineage Model Spline UI Spark structured streaming job Event details ProgressExecution Plans • StreamingQueryManager – Information about start – Can provide execution plans – Information about progress • MicroBatch
  • 18.
    Interval View • Displaysdata flow in fixed interval 18#SAISExp18 Start End Time progress Job W1 Job R Job W2 progress progress progress progress progress Interval Job W1 Job R S1 S2 Job W2 S3
  • 19.
    Demo – UseCase 19#SAISExp18 What is temperature per hour in Prague? Station 2 Station NStation 1 ? Prague …
  • 20.
    Demo – UseCase Output 20#SAISExp18 0 5 10 15 20 25 30 35 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Temperature[°C] Hours Start End 2018-09-24
  • 21.
    Demo – SelectInterval View 21#SAISExp18
  • 22.
    Demo – SelectInterval 22#SAISExp18
  • 23.
    Demo – SelectSink 23#SAISExp18
  • 24.
    Demo – FindHighlighted Sink 24#SAISExp18
  • 25.
    Demo – ReviewThe Lineage 25#SAISExp18
  • 26.
    Demo – ChangeThe Interval 26#SAISExp18
  • 27.
    Demo – ObserveNew Lineage 27#SAISExp18
  • 28.
    Demo – SelectA Job 28#SAISExp18
  • 29.
    Demo – DrillDown 29#SAISExp18
  • 30.
    Demo – ReviewJob Details 30#SAISExp18
  • 31.
    Demo – SelectAn Operation 31#SAISExp18
  • 32.
    Demo – SeeOperation Attributes 32#SAISExp18
  • 33.
    Interval View Limitations 33#SAISExp18 StartEnd Time progress Job W1 Job R Job W2 progress progress progress progress progress Interval Job R S1 S2 S310:21 10:25 10:30 10:35 10:45 10:51
  • 34.
    Interval View Limitations 34#SAISExp18 JobW1 Job R Start End Time progress Job W1 Job R Job W2 progress progress progress progress progress S1 S2 Interval View Interval 10:21 10:25 10:30 10:35 10:45 10:51
  • 35.
    Interval View Limitations •Edge case (delayed read, early write) – Job W1 should be linked – Job W2 should not be linked 35#SAISExp18 Job W1 Job R Start End Time progress Job W1 Job R Job W2 progress progress progress progress progress S1 S2 Job W2 Job R S1 S3 Lineage Interval Interval View 10:21 10:25 10:30 10:35 10:45 10:51
  • 36.
    Beyond The IntervalView • Instead of timestamp use addresses of rows • SS has addresses (offsets) on each source, but not on sinks • Most sinks are also sources and thus could return offsets 36#SAISExp18 Source 2 Offsets: 3 - 5 Job Source 3 Offsets: 12 - 14 Sink Offsets: ? Progress Event
  • 37.
  • 38.
    Offset-Based Linking 38#SAISExp18 Job RProgress offset offset offset S3 S2 S1 Job R S1 Selected
  • 39.
    Offset-Based Linking 39#SAISExp18 Job RProgress offset offset offset 3 - 5 12 - 14 S3 S2 S1 Job R S1 S2 S3 3 - 5 12 - 14 Selected
  • 40.
    Offset-Based Linking 40#SAISExp18 Job W2Progress Job R Progress offset offset offset 3 - 5 12 - 14 S3 S2 S1 Job R S1 S2 Job W2 S3 3 - 5 9 - 19 12 - 14 Selected
  • 41.
    Offset-Based Linking 41#SAISExp18 Job W2Progress Job W1 Progress Job R Progress offset offset offset 3 - 5 12 - 14 22 - 27 S3 S2 S1 Job X Progress Job W1 Job R S1 S2 Job W2 S3 3 - 5 22 - 27 9 - 19 12 - 14 Selected
  • 42.
    Offset-Based Linking • Jobsare linked when progress offsets overlap • Offset timestamp doesn’t matter 42#SAISExp18 Job W1 Job R S1 S2 Job W2 S3 3 - 5 22 - 27 9 - 19 12 - 14 Job W2 Progress Job R Progress offset offset offset 3 - 5 12 - 14 22 - 27 S3 S2 S1 Job W1 Progress Job X Progress Selected
  • 43.
    Conclusion • Spline: datalineage tracking tool • New support for Structured Streaming • Demo POC: Interval View • Proposed generalization: offset-based linking 43#SAISExp18
  • 44.
    Future Plans • ReleaseInterval View in Spline • After changes to Spark: – Offset based linking for micro-batch streaming – Continuous streaming support • Support for dataset checkpoints 44#SAISExp18
  • 45.
    Questions • Now isa good time • Or feel free to contact us – Marek Novotny • mn.mikke@gmail.com – Vaclav Kosar • admin@vaclavkosar.com • github.com/AbsaOSS/spline 45#SAISExp18