Data lineage tracking is one of the significant problems that companies in highly regulated industries face. These companies are forced to have a good understanding of how data flows through their systems to comply with strict regulatory frameworks. Many of these organizations also utilize big and fast data technologies such as Hadoop, Apache Spark and Kafka. Spark has become one of the most popular engines for big data computing. In recent releases, Spark also provides the Structured Streaming component, which allows for real-time analysis and processing of streamed data from many sources. Spline is a data lineage tracking and visualization tool for Apache Spark. Spline captures and stores lineage information from internal Spark execution plans in a lightweight, unobtrusive and easy to use manner.
Additionally, Spline offers a modern user interface that allows non-technical users to understand the logic of Apache Spark applications. In this presentation we cover the support of Spline for Structured Streaming and we demonstrate how data lineage can be captured for streaming applications.
Presented at Spark Summit London 2018
2. About Us
•ABSA is a Pan-African financial services provider
– With Apache Spark at the core of its data engineering
•We try to fill gaps in the Hadoop eco-system
•Contributions to Apache Spark
•Spark-related open-source projects (github.com/AbsaOSS)
– ABRiS – Avro SerDe for structured APIs (#SAISDev5)
– Cobrix – Cobol data source
– Atum – Completeness and accuracy library
– Spline – Data lineage tracking and visualization tool (#EUent3)
2#SAISExp18
3. • How data is calculated?
• What is the schema and format of
streamed data?
3#SAISExp18
01000110101101010
5. 5#SAISExp18
Transformations Job 3 Details
Topic D //path/
Join
colA + colB
Topic Z
Job 2
Job 3
Job 1
01000110101101010
Topic A
Topic B
Topic Z
Topic D
Path /…/abc
6. 6#SAISExp18
Schema A
Schema B
Schema C
Schema D
Schema Z
Schema C
Schema D
Job 2
Job 3
Job 1
Schemas and Formats
01000110101101010
Topic A
Topic B
Topic Z
Topic D
Path /…/abc
10. Spline
•To make Spark BCBS
(Clarity) compliant
•To communicate with
business people
•Online documentation of
–Job dependencies
– Spark SQL job details
– Attributes occurring in the
logic
10#SAISExp18
Dependencies Details Attributes
11. Lineage Tracking of Batch Jobs
• Dataset-oriented
• Leverages execution plans
• Structured APIs only
– SQL
– Dataframes
– Datasets
• UDFs and lambdas are
considered as black boxes
11#SAISExp18
Job
Dataset A
Lineage A
12. Lineage Tracking of Streaming Jobs
• Structured Streaming only
• Source-oriented (topic)
• Evolves in time
12#SAISExp18
App
Lineage T1
Topic A
Time
Lineage T3
Lineage T2
33. Interval View Limitations
33#SAISExp18
Start End Time
progress
Job W1
Job R
Job W2 progress
progress progress
progress progress
Interval
Job R
S1
S2 S310:21 10:25
10:30 10:35
10:45 10:51
34. Interval View Limitations
34#SAISExp18
Job W1
Job R
Start End Time
progress
Job W1
Job R
Job W2 progress
progress progress
progress progress
S1
S2
Interval View
Interval
10:21 10:25
10:30 10:35
10:45 10:51
35. Interval View Limitations
• Edge case (delayed read, early write)
– Job W1 should be linked
– Job W2 should not be linked
35#SAISExp18
Job W1
Job R
Start End Time
progress
Job W1
Job R
Job W2 progress
progress progress
progress progress
S1
S2
Job W2
Job R
S1
S3
Lineage
Interval
Interval View
10:21 10:25
10:30 10:35
10:45 10:51
36. Beyond The Interval View
• Instead of timestamp use
addresses of rows
• SS has addresses (offsets) on
each source, but not on sinks
• Most sinks are also sources and
thus could return offsets
36#SAISExp18
Source 2
Offsets:
3 - 5
Job
Source 3
Offsets:
12 - 14
Sink
Offsets:
?
Progress Event
43. Conclusion
• Spline: data lineage tracking tool
• New support for Structured Streaming
• Demo POC: Interval View
• Proposed generalization: offset-based linking
43#SAISExp18
44. Future Plans
• Release Interval View in Spline
• After changes to Spark:
– Offset based linking for micro-batch streaming
– Continuous streaming support
• Support for dataset checkpoints
44#SAISExp18
45. Questions
• Now is a good time
• Or feel free to contact us
– Marek Novotny
• mn.mikke@gmail.com
– Vaclav Kosar
• admin@vaclavkosar.com
• github.com/AbsaOSS/spline
45#SAISExp18