Jan Scherbaum & Marek Novotny
Barclays Africa Group Limited
SPLINE:
APACHE SPARK LINEAGE, NOT ONLY
FOR THE BANKING INDUSTRY
#EUent3
7.34
Stuff happens
under the hood…
#EUent3 2
Data type
changes
Adding new dataComplex calcs
7.34
#EUent3 3
Overview
• Barclays Africa
• Spline
• Live demo
• Future work
• The big picture
• Questions
4#EUent3
• Pan-African financial services provider
– Providing services to South Africa (ABSA) and 11 other
African countries (Barclays)
• Subject to strict regulatory compliance
– Basel Committee on Banking Supervision (BCBS)
• Accuracy
• Comprehensiveness
• Clarity
• Usefulness
5#EUent3
Barclays Africa Group Limited
• SPark LINEage
• Open source project
• Goals
– Satisfy initial interpretations of regulatory
requirements, specifically on “Clarity”
• Data lineage from Spark’s execution plans
• Visualize in an “explorable” user-friendly format
6#EUent3
Spline
Spline – How it works
7#EUent3
Spark job
Spark Job
Spark library
Spark Session
Action
Transformations
Generate execution plans
Spline – How it works
8#EUent3
Spark job
1 line initialization
Use SQLContext listeners
Generate execution plans
Spark Job & Spline
Spark library
Spark Session
Action
Spline
Transformations
Spline UI
Spline – How it works
9#EUent3
Spark job
1 line initialization
Use SQLContext listeners
Generate execution plans
Spark Job & Spline
Spark library
Spark Session
Action
Spline
Transformations
…
Demo use case
• Find the countries with the highest annual beer
consumption per person
– Correlation with GDP??
10#EUent3
Data
11#EUent3
Country 2011 2010 2009
Czech Republic 15,583,000 15,549,000 16,190,000
Ireland 4,721,000 4,814,000 4,832,000
Country Metric 2011 2010 2009
Czech Republic GDP $21,717 $19,764 $19,698
Czech Republic Population 10,496,088 10,474,410 10,443,936
Ireland GDP $52,567 $48,538 $51,983
Ireland Population 4,576,794 4,560,155 4,535,375
Beer consumption per country
Development indicators from the world bank
Analysis
• Marek’s job
– Data prep
– Analyze the correlation between beer consumption
and GDP growth
• Jan’s beer job
– Calculate the consumption of beer per country per
capita
12#EUent3
Dependencies
<dependency>
<groupId>za.co.absa</groupId>
<artifactId>spline-core</artifactId>
<version>${spline.version}</version>
</dependency>
<dependency>
<groupId>za.co.absa</groupId>
<artifactId>spline-persistence-mongo</artifactId>
<version>${spline.version}</version>
</dependency>
13#EUent3
Initialization
14#EUent3
// Initializing library to hook up to Apache Spark
import za.co.absa.spline.core.SparkLineageInitializer._
spark.enableLineageTracking()
15#EUent3
Next steps
• Enterprise features
– Authentication (Kerberos, SSO)
– Authorization
– User management
• Interoperability with other tools
– Cloudera Manager, Informatica, Apache Atlas
• Support other Spark data sources and actions
– Streaming, ML
16#EUent3
The bigger picture
• Develop open source conformance & ingestion
engine on Spark
– BCBS compliant (lineage, dataflow controls, error tracking)
– Transfer & transform data from different source systems
• To strongly typed datasets
• On Hadoop
• Conforming to enterprise level data dictionaries & data quality
• In development – stay tuned J
17#EUent3
We’re open source!
• Contributions are most welcome
• Released versions mirrored on
– https://github.com/AbsaOSS/spline
• Wiki and docs on
– https://absaoss.github.io/spline/
18#EUent3
Questions
19#EUent3
• Now is a good time
• Or feel free to contact us
– Jan Scherbaum
• jan.scherbaum@barclays.com
– Marek Novotny
• marek.x.novotny@barclays.com
– Oleksandr Vayda
• oleksandr.vayda@barclays.com
• Acknowledgements:
– Dennis Chu, Aaisha Bibi Osman, Adam Smyczek, Andrew Baker

Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Novotny Jan Scherbaum

  • 1.
    Jan Scherbaum &Marek Novotny Barclays Africa Group Limited SPLINE: APACHE SPARK LINEAGE, NOT ONLY FOR THE BANKING INDUSTRY #EUent3
  • 2.
  • 3.
    Data type changes Adding newdataComplex calcs 7.34 #EUent3 3
  • 4.
    Overview • Barclays Africa •Spline • Live demo • Future work • The big picture • Questions 4#EUent3
  • 5.
    • Pan-African financialservices provider – Providing services to South Africa (ABSA) and 11 other African countries (Barclays) • Subject to strict regulatory compliance – Basel Committee on Banking Supervision (BCBS) • Accuracy • Comprehensiveness • Clarity • Usefulness 5#EUent3 Barclays Africa Group Limited
  • 6.
    • SPark LINEage •Open source project • Goals – Satisfy initial interpretations of regulatory requirements, specifically on “Clarity” • Data lineage from Spark’s execution plans • Visualize in an “explorable” user-friendly format 6#EUent3 Spline
  • 7.
    Spline – Howit works 7#EUent3 Spark job Spark Job Spark library Spark Session Action Transformations Generate execution plans
  • 8.
    Spline – Howit works 8#EUent3 Spark job 1 line initialization Use SQLContext listeners Generate execution plans Spark Job & Spline Spark library Spark Session Action Spline Transformations
  • 9.
    Spline UI Spline –How it works 9#EUent3 Spark job 1 line initialization Use SQLContext listeners Generate execution plans Spark Job & Spline Spark library Spark Session Action Spline Transformations …
  • 10.
    Demo use case •Find the countries with the highest annual beer consumption per person – Correlation with GDP?? 10#EUent3
  • 11.
    Data 11#EUent3 Country 2011 20102009 Czech Republic 15,583,000 15,549,000 16,190,000 Ireland 4,721,000 4,814,000 4,832,000 Country Metric 2011 2010 2009 Czech Republic GDP $21,717 $19,764 $19,698 Czech Republic Population 10,496,088 10,474,410 10,443,936 Ireland GDP $52,567 $48,538 $51,983 Ireland Population 4,576,794 4,560,155 4,535,375 Beer consumption per country Development indicators from the world bank
  • 12.
    Analysis • Marek’s job –Data prep – Analyze the correlation between beer consumption and GDP growth • Jan’s beer job – Calculate the consumption of beer per country per capita 12#EUent3
  • 13.
  • 14.
    Initialization 14#EUent3 // Initializing libraryto hook up to Apache Spark import za.co.absa.spline.core.SparkLineageInitializer._ spark.enableLineageTracking()
  • 15.
  • 16.
    Next steps • Enterprisefeatures – Authentication (Kerberos, SSO) – Authorization – User management • Interoperability with other tools – Cloudera Manager, Informatica, Apache Atlas • Support other Spark data sources and actions – Streaming, ML 16#EUent3
  • 17.
    The bigger picture •Develop open source conformance & ingestion engine on Spark – BCBS compliant (lineage, dataflow controls, error tracking) – Transfer & transform data from different source systems • To strongly typed datasets • On Hadoop • Conforming to enterprise level data dictionaries & data quality • In development – stay tuned J 17#EUent3
  • 18.
    We’re open source! •Contributions are most welcome • Released versions mirrored on – https://github.com/AbsaOSS/spline • Wiki and docs on – https://absaoss.github.io/spline/ 18#EUent3
  • 19.
    Questions 19#EUent3 • Now isa good time • Or feel free to contact us – Jan Scherbaum • jan.scherbaum@barclays.com – Marek Novotny • marek.x.novotny@barclays.com – Oleksandr Vayda • oleksandr.vayda@barclays.com • Acknowledgements: – Dennis Chu, Aaisha Bibi Osman, Adam Smyczek, Andrew Baker