Growing a better world together
Rabobank Group
DataWorks Summit, Berlin, 2018
Introduction
1996
2007
JeroenWolffensperger
Solution Architect Data
jeroen.wolffensperger@rabobank.nl
Introduction
martijn.groen@rabobank.nl
1995
1996
2007
2010
2015
Martijn Groen Msc. PMP
Rabobank Netherlands HQ
Delivery manager Data Lake & Delivery
Distribution (Client Data)
5Source: https://www.iea.org/newsroom/
6
7
8
9
Rabobank
11
What is needed to create all these new business
models?
Lot’s of
Data
Formats
Speed
Data Architecture
Data product types & Development styles
Raw & Defined data Information Product
Ad-Hoc Data R&D / Analytics
Source: Damhof Quadrant
OperationalizeOperationalize
Process
Dataflow
Data scientist,
Analysts
End-uses
Systematic
Opportunistic
Push/Supply/Source driven Pull/Demand/Product driven
Develop-
ment
Style
Push-Pull
point
Data Architecture
Data product types & Development styles
Raw & Defined data Information Product
Ad-Hoc Data R&D / Analytics
Systematic
Opportunistic
Push/Supply/Source driven Pull/Demand/Product driven
Develop-
ment
Style
Data Lab
Data Factory
Data Lake
Business-value
Provisioning
Data Lab
Data Architecture
Business
Intelligence
Analytics
Marketing
On-line Services
Real-time relevance
Data Lake
Sources
Data
Domains
External
Data
Data Management (Data Governance, Data Lineage, Data Quality, Metadata Management, Data Catalog, Data Security)
Batch services
Real-time
services
services
Data Factory
Definition Factory
Information Factory
Monitoring (Infrastructure, Data usage)
Data Architecture building blocks
• Based on manufacturing
production process
• Each building block is
replaceable.
Data Logistics
Data Storage
Meta Data Storage
Data Refinery
Data
Provisioning
Transport
Compute
Catalog
Provide
Secure
Store
Resource management Monitor
import export
Data Architecture: technology
Business-value
Provisioning
Data Lab
Business
Intelligence
Analytics
Marketing
On-line Services
Real-time relevance
Data Lake
Sources
Data
Domains
External
Data
Data Management (Data Governance, Data Lineage, Data Quality, Metadata Management, Data Catalog, Data Security)
Batch services
Real-time
services
services
Data Factory
Definition Factory
Information Factory
Monitoring (Infrastructure, Data usage)
Kafka: https://www.datanami.com/2017/08/15/kafka-helped-rabobank-modernize-alerting-system/
Data Logistics
Big Data Management
Why did we choose for: HDF – NiFi?
January 2017
• We compared: NiFi, Informatica Intelligent Streaming, Streamsets Data Collector
• NiFi has an open architecture, making it easy to create your own connectors.
• NiFi has the most functionality and is easy to use
• NiFi has the biggest user base and a very active community.
• Works well in combination with Cloudera.
• No data lineage and support for template deployment yet, but are on the roadmap
(release 3.2).
• Informatica’s first release of Intelligent Streaming* was December 2016. Product was
not yet mature enough.
• Streamsets is 100% in memory, where NiFi writes to disk. In our opinion less mature
than NiFi.
* Renamed to Big Data Streaming since January 2018
Data Architecture: technology
Business-value
Provisioning
Data Lab
Business
Intelligence
Analytics
Marketing
On-line Services
Real-time relevance
Data Lake
Sources
Data
Domains
External
Data
Data Management (Data Governance, Data Lineage, Data Quality, Metadata Management, Data Catalog, Data Security)
Batch services
Real-time
services
services
Data Factory
Definition Factory
Information Factory
Monitoring (Infrastructure, Data usage)
HDFS
Data Storage
Data Architecture: technology
Business-value
Provisioning
Data Lab
Business
Intelligence
Analytics
Marketing
On-line Services
Real-time relevance
Data Lake
Sources
Data
Domains
External
Data
Data Management (Data Governance, Data Lineage, Data Quality, Metadata Management, Data Catalog, Data Security)
Batch services
Real-time
services
services
Data Factory
Definition Factory
Information Factory
Monitoring (Infrastructure, Data usage)
Data Refinery
Big Data Management
Data Architecture: technology
Business-value
Provisioning
Data Lab
Business
Intelligence
Analytics
Marketing
On-line Services
Real-time relevance
Data Lake
Sources
Data
Domains
External
Data
Data Management (Data Governance, Data Lineage, Data Quality, Metadata Management, Data Catalog, Data Security)
Batch services
Real-time
services
services
Data Factory
Definition Factory
Information Factory
Monitoring (Infrastructure, Data usage)
Data Provisioning
Data Architecture: technology
Business-value
Provisioning
Data Lab
Business
Intelligence
Analytics
Marketing
On-line Services
Real-time relevance
Data Lake
Sources
Data
Domains
External
Data
Data Management (Data Governance, Data Lineage, Data Quality, Metadata Management, Data Catalog, Data Security)
Batch services
Real-time
services
services
Data Factory
Definition Factory
Information Factory
Monitoring (Infrastructure, Data usage)
Data Governance
Enterprise Data Catalog
Navigator
Big Data Management
Business case: Bedrijfskompas
(Company Compass)
• We deliver insight in your financial position and a benchmark about the
performance of other companies within your own branch or sector.
• We will do this via:
• An online dashboard with a graphical presentation of your liquidity.
• Displaying the performance of your company compared to aggregated
benchmark data of peers from the sector.
• We first implemented the liquidity dashboard and is currently made
accessible as stand alone visual via our internet banking environment.
Liquidity dashboard
Growth Hack Prototype Final F&F Release
Concept Growth Hack
Prototype
Initial F&F
Release
Final F&F
Release
Pilot Bank
Release
Full scale
release
Data Lab
Start Data
Lake
Connection
real-time
transactie data
Security
First API
endpoint live
Full API live
Performance
tuning
First API
specification
OpenshiftBig Data
Cluster
Start Front-End
team
26
Some figures
• Business case implemented in 8 months including initial set-up
infrastructure and security.
• HortonWorks Data Flow (3.0.2):
• Able to process 100.000 events per sec.; 0,6 GB per sec.
• Initial load: 25 billion payment transactions; 7 years of history loaded in 7
hours.
• NRT load: average of 15 million transactions per day
• Current average response time API call: < 100 ms
• Initial set-up costs are earned back via other business cases making use
of the infrastructure.
Key takeaways
• Fail fast: Experimental approach gives quick insights of possible fit within the
overall data architecture.
• Every technology component must be replaceable when choices made earlier
are proved to be not as good as expected and Hadoop technologies change fast.
• Hire (professional services) expertise for securing your cluster. Kerberos is a
headache but necessary.We thought we secured everything: NOT.
• Stay in control of the data provisioned via API’s.
• Data Governance is key to keep an overview of your Data Lake and also to
comply with all regulations like GDPR and BCBS239. A good Data Catalog is a
must.
2929
Thank you for your attention!

From an experiment to a real production environment

  • 1.
    Growing a betterworld together Rabobank Group DataWorks Summit, Berlin, 2018
  • 2.
  • 3.
    Introduction martijn.groen@rabobank.nl 1995 1996 2007 2010 2015 Martijn Groen Msc.PMP Rabobank Netherlands HQ Delivery manager Data Lake & Delivery Distribution (Client Data)
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 13.
    What is neededto create all these new business models? Lot’s of Data Formats Speed
  • 14.
    Data Architecture Data producttypes & Development styles Raw & Defined data Information Product Ad-Hoc Data R&D / Analytics Source: Damhof Quadrant OperationalizeOperationalize Process Dataflow Data scientist, Analysts End-uses Systematic Opportunistic Push/Supply/Source driven Pull/Demand/Product driven Develop- ment Style Push-Pull point
  • 15.
    Data Architecture Data producttypes & Development styles Raw & Defined data Information Product Ad-Hoc Data R&D / Analytics Systematic Opportunistic Push/Supply/Source driven Pull/Demand/Product driven Develop- ment Style Data Lab Data Factory Data Lake
  • 16.
    Business-value Provisioning Data Lab Data Architecture Business Intelligence Analytics Marketing On-lineServices Real-time relevance Data Lake Sources Data Domains External Data Data Management (Data Governance, Data Lineage, Data Quality, Metadata Management, Data Catalog, Data Security) Batch services Real-time services services Data Factory Definition Factory Information Factory Monitoring (Infrastructure, Data usage)
  • 17.
    Data Architecture buildingblocks • Based on manufacturing production process • Each building block is replaceable. Data Logistics Data Storage Meta Data Storage Data Refinery Data Provisioning Transport Compute Catalog Provide Secure Store Resource management Monitor import export
  • 18.
    Data Architecture: technology Business-value Provisioning DataLab Business Intelligence Analytics Marketing On-line Services Real-time relevance Data Lake Sources Data Domains External Data Data Management (Data Governance, Data Lineage, Data Quality, Metadata Management, Data Catalog, Data Security) Batch services Real-time services services Data Factory Definition Factory Information Factory Monitoring (Infrastructure, Data usage) Kafka: https://www.datanami.com/2017/08/15/kafka-helped-rabobank-modernize-alerting-system/ Data Logistics Big Data Management
  • 19.
    Why did wechoose for: HDF – NiFi? January 2017 • We compared: NiFi, Informatica Intelligent Streaming, Streamsets Data Collector • NiFi has an open architecture, making it easy to create your own connectors. • NiFi has the most functionality and is easy to use • NiFi has the biggest user base and a very active community. • Works well in combination with Cloudera. • No data lineage and support for template deployment yet, but are on the roadmap (release 3.2). • Informatica’s first release of Intelligent Streaming* was December 2016. Product was not yet mature enough. • Streamsets is 100% in memory, where NiFi writes to disk. In our opinion less mature than NiFi. * Renamed to Big Data Streaming since January 2018
  • 20.
    Data Architecture: technology Business-value Provisioning DataLab Business Intelligence Analytics Marketing On-line Services Real-time relevance Data Lake Sources Data Domains External Data Data Management (Data Governance, Data Lineage, Data Quality, Metadata Management, Data Catalog, Data Security) Batch services Real-time services services Data Factory Definition Factory Information Factory Monitoring (Infrastructure, Data usage) HDFS Data Storage
  • 21.
    Data Architecture: technology Business-value Provisioning DataLab Business Intelligence Analytics Marketing On-line Services Real-time relevance Data Lake Sources Data Domains External Data Data Management (Data Governance, Data Lineage, Data Quality, Metadata Management, Data Catalog, Data Security) Batch services Real-time services services Data Factory Definition Factory Information Factory Monitoring (Infrastructure, Data usage) Data Refinery Big Data Management
  • 22.
    Data Architecture: technology Business-value Provisioning DataLab Business Intelligence Analytics Marketing On-line Services Real-time relevance Data Lake Sources Data Domains External Data Data Management (Data Governance, Data Lineage, Data Quality, Metadata Management, Data Catalog, Data Security) Batch services Real-time services services Data Factory Definition Factory Information Factory Monitoring (Infrastructure, Data usage) Data Provisioning
  • 23.
    Data Architecture: technology Business-value Provisioning DataLab Business Intelligence Analytics Marketing On-line Services Real-time relevance Data Lake Sources Data Domains External Data Data Management (Data Governance, Data Lineage, Data Quality, Metadata Management, Data Catalog, Data Security) Batch services Real-time services services Data Factory Definition Factory Information Factory Monitoring (Infrastructure, Data usage) Data Governance Enterprise Data Catalog Navigator Big Data Management
  • 24.
    Business case: Bedrijfskompas (CompanyCompass) • We deliver insight in your financial position and a benchmark about the performance of other companies within your own branch or sector. • We will do this via: • An online dashboard with a graphical presentation of your liquidity. • Displaying the performance of your company compared to aggregated benchmark data of peers from the sector. • We first implemented the liquidity dashboard and is currently made accessible as stand alone visual via our internet banking environment.
  • 25.
    Liquidity dashboard Growth HackPrototype Final F&F Release Concept Growth Hack Prototype Initial F&F Release Final F&F Release Pilot Bank Release Full scale release Data Lab Start Data Lake Connection real-time transactie data Security First API endpoint live Full API live Performance tuning First API specification OpenshiftBig Data Cluster Start Front-End team
  • 26.
  • 27.
    Some figures • Businesscase implemented in 8 months including initial set-up infrastructure and security. • HortonWorks Data Flow (3.0.2): • Able to process 100.000 events per sec.; 0,6 GB per sec. • Initial load: 25 billion payment transactions; 7 years of history loaded in 7 hours. • NRT load: average of 15 million transactions per day • Current average response time API call: < 100 ms • Initial set-up costs are earned back via other business cases making use of the infrastructure.
  • 28.
    Key takeaways • Failfast: Experimental approach gives quick insights of possible fit within the overall data architecture. • Every technology component must be replaceable when choices made earlier are proved to be not as good as expected and Hadoop technologies change fast. • Hire (professional services) expertise for securing your cluster. Kerberos is a headache but necessary.We thought we secured everything: NOT. • Stay in control of the data provisioned via API’s. • Data Governance is key to keep an overview of your Data Lake and also to comply with all regulations like GDPR and BCBS239. A good Data Catalog is a must.
  • 29.
    2929 Thank you foryour attention!