Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Novotny Jan Scherbaum
The document discusses the usage of Apache Spark lineage in Barclays Africa Group Limited to meet regulatory compliance and enhance data clarity. It introduces the Spline tool that enables visualization of Spark execution plans, emphasizing its functionality and future development plans. The presentation includes a demonstration of a use case analyzing beer consumption per capita in correlation with GDP across various countries.
• Pan-African financialservices provider
– Providing services to South Africa (ABSA) and 11 other
African countries (Barclays)
• Subject to strict regulatory compliance
– Basel Committee on Banking Supervision (BCBS)
• Accuracy
• Comprehensiveness
• Clarity
• Usefulness
5#EUent3
Barclays Africa Group Limited
6.
• SPark LINEage
•Open source project
• Goals
– Satisfy initial interpretations of regulatory
requirements, specifically on “Clarity”
• Data lineage from Spark’s execution plans
• Visualize in an “explorable” user-friendly format
6#EUent3
Spline
Spline – Howit works
8#EUent3
Spark job
1 line initialization
Use SQLContext listeners
Generate execution plans
Spark Job & Spline
Spark library
Spark Session
Action
Spline
Transformations
9.
Spline UI
Spline –How it works
9#EUent3
Spark job
1 line initialization
Use SQLContext listeners
Generate execution plans
Spark Job & Spline
Spark library
Spark Session
Action
Spline
Transformations
…
10.
Demo use case
•Find the countries with the highest annual beer
consumption per person
– Correlation with GDP??
10#EUent3
11.
Data
11#EUent3
Country 2011 20102009
Czech Republic 15,583,000 15,549,000 16,190,000
Ireland 4,721,000 4,814,000 4,832,000
Country Metric 2011 2010 2009
Czech Republic GDP $21,717 $19,764 $19,698
Czech Republic Population 10,496,088 10,474,410 10,443,936
Ireland GDP $52,567 $48,538 $51,983
Ireland Population 4,576,794 4,560,155 4,535,375
Beer consumption per country
Development indicators from the world bank
12.
Analysis
• Marek’s job
–Data prep
– Analyze the correlation between beer consumption
and GDP growth
• Jan’s beer job
– Calculate the consumption of beer per country per
capita
12#EUent3
Next steps
• Enterprisefeatures
– Authentication (Kerberos, SSO)
– Authorization
– User management
• Interoperability with other tools
– Cloudera Manager, Informatica, Apache Atlas
• Support other Spark data sources and actions
– Streaming, ML
16#EUent3
17.
The bigger picture
•Develop open source conformance & ingestion
engine on Spark
– BCBS compliant (lineage, dataflow controls, error tracking)
– Transfer & transform data from different source systems
• To strongly typed datasets
• On Hadoop
• Conforming to enterprise level data dictionaries & data quality
• In development – stay tuned J
17#EUent3
18.
We’re open source!
•Contributions are most welcome
• Released versions mirrored on
– https://github.com/AbsaOSS/spline
• Wiki and docs on
– https://absaoss.github.io/spline/
18#EUent3
19.
Questions
19#EUent3
• Now isa good time
• Or feel free to contact us
– Jan Scherbaum
• jan.scherbaum@barclays.com
– Marek Novotny
• marek.x.novotny@barclays.com
– Oleksandr Vayda
• oleksandr.vayda@barclays.com
• Acknowledgements:
– Dennis Chu, Aaisha Bibi Osman, Adam Smyczek, Andrew Baker